Skip to content
Livesov
FeaturesHow it WorksPricingFree ToolsBlogContact
LoginGet Started
Free Tools / AI Crawler Checker

AI Crawler Checker

Check whether GPTBot, ClaudeBot, PerplexityBot, Google-Extended and 9 other AI crawlers can access any URL.

We fetch /robots.txt on the same host and apply the matching rules.
Quick answer
The AI Crawler Checker reads your live robots.txt and tells you whether 13 of the most important AI crawlers - GPTBot, ClaudeBot, PerplexityBot, Google-Extended and 9 others - are allowed to fetch any URL you supply. It applies RFC 9309 longest-match rules, surfaces the exact directive that allowed or blocked each bot, and is free with no signup.
Key takeaways
  • AI crawlers obey robots.txt. If you block them, you disappear from the AI engines they feed.
  • There are two kinds of AI crawlers: training bots (one-time data harvest) and user bots (live-fetch during a conversation).
  • Blocking training bots is defensible. Blocking user bots is almost always a self-inflicted wound.
  • Edge-level blocks (Cloudflare, WAFs) bypass robots.txt - if your robots is open but bots are still blocked, look there.
  • A monthly check on your top landing pages catches most accidental regressions.

What is an AI crawler, exactly?

An AI crawler is an automated agent that fetches webpages on behalf of a large language model. Some crawlers harvest text to train a model (GPTBot, ClaudeBot, Google-Extended, CCBot, Bytespider). Others fetch live during a conversation to answer a user question with up-to-date data (ChatGPT-User, Perplexity-User, Claude-Web). Both types matter, and they obey the same robots.txt rules - if you block them, you disappear from the model.

Why this check matters for AI visibility

Sites accidentally block AI crawlers all the time. A standard Disallow: / for archive-cleanup, a Cloudflare bot-fight rule, an over-eager security plugin - any of these can silently strip your content from the next-generation answer engines that route an increasing share of buying-intent traffic.

This tool reads your live robots.txt, applies the longest-match rule per user-agent (the standard the spec defines), and tells you whether each of the 13 crawlers we track is allowed to fetch the URL you supplied. If a crawler is blocked, the rule that did it is shown so you can fix it in seconds.

“
The two most common AI visibility regressions we see are not strategic - they're accidental. A WordPress security plugin gets aggressive, a CDN bot-fight rule gets enabled, or a junior dev pastes a generic 'block all bots' robots.txt off Stack Overflow. Three months later the team wonders why ChatGPT suddenly stopped recommending them.
Nik Sov· Founder, Livesov

The 13 crawlers this tool checks

Each crawler below has its own user-agent string and its own purpose. Robots.txt rules target them by name, so you can allow some and block others. We grade each independently.

CrawlerVendorPurpose
GPTBotOpenAITrains ChatGPT and GPT-series models. Most consequential bot to allow.
OAI-SearchBotOpenAIPowers ChatGPT Search results - a separate index from the training corpus.
ChatGPT-UserOpenAILive-fetches pages during a conversation when ChatGPT browses on a user's behalf.
ClaudeBotAnthropicTrains Claude models. Trusted, well-documented, respects all standard directives.
Claude-WebAnthropicFetches pages live when Claude needs to cite a source.
PerplexityBotPerplexityIndexes pages for Perplexity answers. Citation-heavy product.
Perplexity-UserPerplexityLive fetches pages cited in answers. Blocking it kills your Perplexity citations.
Google-ExtendedGoogleOpt-out for Gemini training and Google Vertex AI. Does NOT affect Google search.
GoogleOtherGoogleCatch-all for Google R&D crawls outside core search.
CCBotCommon CrawlBuilds the public corpus most open-weight models train on.
BytespiderByteDanceTrains ByteDance / TikTok AI models. Known to be aggressive.
Meta-ExternalAgentMetaTrains Llama models. Respects robots.txt as of 2024.
Applebot-ExtendedAppleTrains Apple Intelligence models. Recently introduced opt-out.

Training bots vs user bots

The single most useful framing when deciding what to block: training bots harvest content one-time to feed model pre-training; user bots fetch a page live during a conversation, on behalf of a real user, to ground the answer. Blocking the training bots removes you from future model versions but does not affect today's recommendations. Blocking the user bots removes you from every live citation the moment the rule deploys.

Live-fetch user bots that you almost never want to block

  • ChatGPT-User
  • Claude-Web
  • Perplexity-User
  • OAI-SearchBot

These four bots are the difference between "ChatGPT can recommend me" and "ChatGPT cannot see me". Treat them as you would Googlebot.

How to fix a blocked URL

  1. Open your site's robots.txt at https://yourdomain.com/robots.txt.
  2. Find the user-agent group that matches the blocked bot - or the wildcard User-agent: * group.
  3. Replace Disallow: / with Allow: / (or remove the disallow line entirely) for the paths you want indexed.
  4. If you want to block training but allow live fetches, target only the training bots (GPTBot, ClaudeBot, Google-Extended, CCBot) and leave the *-User bots open.
  5. Re-run this checker to confirm the change.

A copy-paste robots.txt for "allow everything reasonable"

The simplest stance: allow every documented AI crawler, block only the obviously-aggressive ones, and rely on edge rules for emergencies.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Block aggressive scrapers
User-agent: Bytespider
Disallow: /

A robots.txt for "allow live citation, opt out of training"

The middle-ground stance: get cited live by ChatGPT, Claude and Perplexity, but opt out of being absorbed into the next pre-training run.

User-agent: *
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml

# Opt out of training crawlers
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Keep live-fetch bots open
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-Web
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: OAI-SearchBot
Allow: /
Tip: blocking training bots and allowing user bots is a defensible middle ground. It lets ChatGPT and Perplexity cite you in real time without your content being absorbed into the next pre-training run.

Robots.txt patterns we apply

We follow the RFC 9309 longest-match rule. If two directives match the path, the one with the longer pattern wins. Allow: /blog/ beats Disallow: /; Disallow: /blog/draft/ beats Allow: /blog/. Wildcards (*) and end-of-string anchors ($) are honoured. If no group matches the user-agent, we fall back to the * group, then to default-allow.

A few subtleties worth knowing. An empty Disallow: directive (no path) is interpreted as "allow everything". Disallow: /search? targets only paths that include the query string. $ at the end of a pattern anchors the match to the end of the URL. Crawl-delay and sitemap directives are honoured by some bots but ignored by others.

Beyond robots.txt: edge-level blocks

Robots.txt is the polite layer. The forceful layer lives at the edge - Cloudflare, AWS WAF, Fastly, Akamai. Several products now bundle one-click "block AI bots" toggles, and many security plugins (Wordfence, Sucuri) ship aggressive defaults. None of those rules are visible from a robots.txt fetch.

If this checker says all crawlers are allowed but your ChatGPT mention checkershows zero mentions, suspect the edge. Specifically: check Cloudflare's "Block AI Bots" setting, Bot Fight Mode, and any custom firewall rule that filters by user-agent. Also check whether your origin is returning a 403 for the test bot - a JA3 fingerprint mismatch is the common culprit.

A 30-second edge-block diagnostic

  1. Open a terminal and run: curl -A "GPTBot/1.0" -I https://yourdomain.com/your-page
  2. Look at the status. 200 = your origin and edge let GPTBot through. 403 or 429 = blocked at the edge.
  3. Repeat with ClaudeBot, PerplexityBot, ChatGPT-User.
  4. If only some return 403, you have a per-bot edge rule. If all return 403, you have a category-level "block AI" toggle.

Common mistakes

  • Blocking * with Disallow: / and forgetting to add Allow: rules. The wildcard rule applies to bots that have no group - many AI bots fall through to it.
  • Listing the wrong user-agent. OpenAI-GPTBot is not the right name. The correct user-agent is GPTBot (case-insensitive).
  • Trusting the "Disallow: /admin" line in a wildcard group. AI bots may match a more specific group with no Disallow on /admin and slip through. Always check.
  • Adding Crawl-delay to slow GPTBot. GPTBot does not respect Crawl-delay. Use rate limits at the edge instead.
  • Forgetting that robots.txt is per-host. www.example.com/robots.txt and example.com/robots.txt are separate files. Both must be open.

How often to run this check

  • Once now - establish a baseline for your most important landing pages.
  • After every robots.txt change - even a one-character edit.
  • After every CDN / WAF rule deploy - those are the silent regressors.
  • Monthly - on a calendar reminder, against your top 10 landing pages.
  • Whenever an AI mention rate drops - the first thing to rule out is a crawl block.

Frequently asked questions

Should I block GPTBot or allow it?
Allow it unless you have a strong content-licensing strategy. Blocking GPTBot removes your content from ChatGPT’s next training run AND from many citation pathways. The trade-off is rarely worth it for businesses that benefit from being recommended.
Why does my page show as blocked when robots.txt says Allow?
Check for a longer disallow pattern that matches the same URL. Robots rules use longest-match, so Disallow: /blog/draft/ beats Allow: /blog/ for /blog/draft/post-1. The "Rule applied" column tells you exactly which directive won.
My site has no robots.txt. Am I safe?
You are by default. The absence of robots.txt means everything is allowed. We surface that explicitly in the result so you know the answer is "yes, all crawlers can access this URL".
Does this tool fetch the URL itself?
No. We only fetch /robots.txt on the same host and apply the rules. We never request the user-supplied URL. If your robots.txt is unreachable, we say so.
What about Cloudflare or WAF blocks?
This tool checks robots.txt only. Edge-level blocks (Cloudflare’s "Block AI Bots" toggle, IP allowlists, WAF rules) sit above robots and are not visible from a robots fetch. If your robots.txt looks open but AI bots still cannot reach the page, suspect the edge.
How often should I check?
Whenever you change robots.txt, deploy a new edge rule, or migrate hosts. We recommend a quarterly check on every important landing page, and an immediate check any time AI mention rates drop unexpectedly.
Does Disallow: / really block ChatGPT?
For the wildcard group it blocks any bot without an explicit allow rule. GPTBot has its own group; if you have not declared one, it falls through to the wildcard. The result is yes - your site is blocked from training. Live-fetch (ChatGPT-User) follows the same fall-through unless you explicitly allow it.
What is the difference between Google-Extended and Googlebot?
Googlebot crawls for Google search. Google-Extended is a separate user-agent for Gemini training and Vertex AI. Blocking Google-Extended does NOT affect your Google search rankings. It only opts you out of generative AI training.
My robots.txt looks fine but the page is still 403. Why?
Edge-level blocking. Cloudflare's Bot Fight Mode, AWS WAF rules, and many security plugins reject AI user-agents before robots.txt is even evaluated. Run the curl diagnostic above to see status codes per bot.
Should I add a meta noai tag to pages I want excluded?
It does no harm but adoption is patchy. The robots.txt directive is the universally-understood signal. Use noai/noimageai meta as a belt-and-braces additional layer if you want.

Related free tools

llms.txt Generator
Build a curated AI reading list for your site.
Open →
GEO Score Checker
Score any page on its AI-readiness in seconds.
Open →
ChatGPT Mention Checker
See if ChatGPT mentions your brand for any question.
Open →

Want continuous tracking instead of a one-off check? Livesov monitors your AI visibility across ChatGPT, Perplexity, Claude, Gemini and Grok every day.

Start Tracking Free

Ready to track your AI visibility?

Monitor your brand across ChatGPT, Perplexity, Claude, Gemini & Grok.
Get Started

No credit card required.

Livesov
AI Visibility Tracker — Track how AI platforms mention your brand across ChatGPT, Perplexity, Claude, Gemini & Grok.

Product

FeaturesPricingHow it WorksUse CasesIntegrations

Resources

BlogGEO GuideAboutContactChangelogPartnersLivesov vs AhrefsLivesov vs Semrush

Free Tools

All Toolsllms.txt GeneratorAI Crawler CheckerChatGPT Mention CheckerGEO Score CheckerAI Readiness AuditShare of Voice CalculatorPrompt GeneratorAI Citation FinderAI Competitor Finder

AI Platforms

ChatGPT TrackingPerplexity TrackingClaude TrackingGemini TrackingGrok Tracking

Legal

Privacy PolicyTerms of ServiceCookie Policy
© 2026 Livesov. All rights reserved.
✉𝕏in