Robots.txt: The Complete 2026 Guide (With AI Crawler Rules)

From ToolzPedia, the free tools encyclopedia · 🗺️ Guides · 7 min read

For more articles, see the ToolzPedia blog. For tools, see All tools.

Article details

Category	Guides
Author	Mukhtiar Ali
Published	15 May 2026
Reading time	7 minutes
Tags	seo, robots.txt, technical seo, crawling, indexing, ai crawlers, gptbot, googlebot, tutorial

The 200 most important bytes on your site. Get robots.txt right and crawlers behave; get it wrong and your site can vanish from Google. Here's what every directive does, including the new AI crawler rules.

robots.txt is the most powerful 200 bytes on your website. Get it right, and search engines crawl exactly the pages you want them to. Get it wrong, one stray slash, one misordered directive, and you can accidentally deindex your entire site, block important pages from Google, or leave private directories wide open.

This guide explains how robots.txt actually works in 2026, the specific rules every site should have, the mistakes that nuke SEO overnight, and how to build a correct file in two minutes.

What robots.txt is (and isn't)

A robots.txt file lives at your site's root (https://yoursite.com/robots.txt) and tells web crawlers, Googlebot, Bingbot, AI scrapers, and others, which paths they can access and which they can't. Crawlers check this file before crawling your site.

It is:

A polite request, not a security boundary
The first thing Google reads on every crawl
Specific to each user-agent (different rules for different crawlers)
The right place to point to your sitemap

It is not:

A way to hide private content. Anyone can read your robots.txt, and the URLs you list as "Disallow" are now public knowledge. Real security comes from authentication, not robots.txt.
A way to remove a page from Google. Disallow blocks crawling, but if other sites link to the URL, Google may still list it in results with no description. Use noindex meta tags for true removal.
Honored by every crawler. Reputable search engines obey it; malicious scrapers ignore it entirely.

The anatomy of robots.txt

A file is made of one or more groups, each starting with a User-agent: line. Within a group, you have Allow: and Disallow: directives. Outside any group, Sitemap: declarations.

User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Allow: /private/public-resource.pdf

User-agent: *
Disallow: /cart/
Disallow: /checkout/

Sitemap: https://example.com/sitemap.xml

Reading top to bottom: Googlebot can't crawl /admin/ or /private/, except it can crawl /private/public-resource.pdf. All other crawlers can't crawl /cart/ or /checkout/. Both can find the sitemap at the URL given.

The rules every site should have

These four rules apply to almost every site and won't break anything:

User-agent: *
Disallow: /wp-admin/         # WordPress only, admin area
Disallow: /admin/            # generic admin area
Disallow: /cart/             # e-commerce only, shopping cart
Disallow: /checkout/         # e-commerce only, checkout
Disallow: /search/           # internal search results
Disallow: /*?s=              # WordPress search URLs
Disallow: /*?orderby=        # e-commerce sort URLs (duplicate content)

Sitemap: https://yoursite.com/sitemap.xml

Adjust the paths to match your CMS. The principle: disallow anything that has no SEO value or generates infinite URL variations (search results, sort/filter pages, user accounts, carts).

Wildcards and operators that actually work

The robots.txt spec is small but has a few useful features:

**, matches any sequence of characters. /private/ matches /private/anything/here.
$, matches the end of a URL. /*.pdf$ matches only URLs ending in .pdf.
#, starts a comment. Everything after # on a line is ignored.

Examples:

Disallow: /*.pdf$            # block crawling of all PDFs
Disallow: /tag/*             # block all tag archive pages
Disallow: /*?               # block all URLs containing a question mark
Allow: /tag/featured        # exception: allow this one tag archive

Order matters for Allow/Disallow conflicts. Google's rule: the most specific match wins (longest matching path), regardless of order in the file. Some other crawlers use file order, but Googlebot uses specificity.

The five mistakes that kill SEO

1. `Disallow: /` for all user agents

User-agent: *
Disallow: /

This blocks every crawler from every URL on your site. If you accidentally push this to production, your site disappears from Google within a week.

It's depressingly common, usually because a developer set it on a staging server and the file got copied over to production.

2. Blocking CSS and JavaScript

Old advice said to block /css/ and /js/ to save crawl budget. Don't. Google needs to render your pages to understand them. If CSS and JS are blocked, your pages may look broken to Google and lose rankings.

Specifically wrong:

Disallow: /wp-content/themes/
Disallow: /wp-includes/
Disallow: /assets/

If your site uses these paths and you block them, Google can't see your styling or interactivity and may judge the site harshly. Leave them allowed.

3. Blocking pages you want indexed

If you Disallow: a page in robots.txt, Google can't read it, and therefore can't read its noindex meta tag, canonical tag, or content. The page may still appear in search results as a URL-only listing if other sites link to it.

For pages you want to keep out of Google, use noindex in the page's <head>. Don't use robots.txt for that.

4. Inconsistent crawler groups

User-agent: Googlebot
Disallow: /

User-agent: *
Allow: /

Googlebot reads its own group (which says block everything) and ignores the wildcard group. Result: blocked from Google. The * group only applies to crawlers that don't have their own group.

If you want Googlebot to follow the same rules as everyone else, don't give it a separate group at all.

5. Missing sitemap reference

User-agent: *
Disallow: /admin/

No Sitemap: line. Google might still find your sitemap if you've submitted it in Search Console, but Bing, DuckDuckGo, and other engines may miss it entirely. Always include the line.

AI crawler rules (the 2026 question)

A new category of crawlers fetches your content to train AI models or feed AI search engines: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI training crawler, separate from Googlebot), PerplexityBot, CCBot (Common Crawl), and others.

Some sites want to block these; others want their content represented in AI answers. There's no universal right choice. To block all AI training crawlers while still allowing search:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: CCBot
Disallow: /

Important: blocking Google-Extended does not block Googlebot. The two are separate user agents, one for search, one for AI training. You can block AI training while keeping search.

If you want your content to be discoverable in AI search but not used for training, the situation is murkier, most crawlers don't yet distinguish "training" from "answering," and policies are evolving. Many publishers in 2026 have decided to allow ClaudeBot and PerplexityBot (which produce direct citations and traffic) while blocking GPTBot and CCBot (which produce neither). Your call.

How to generate a correct robots.txt

Using our free robots.txt generator:

1. Pick your CMS or platform

WordPress, Shopify, Webflow, Squarespace, Wix, custom. The tool pre-fills the standard blocks for that platform, /wp-admin/ for WordPress, /admin/ and /cart/ for Shopify, etc.

2. Add custom disallow paths

Type any additional paths to block, staging directories, internal tools, file types you don't want indexed.

3. Set AI crawler preferences

Toggle whether to allow or block each major AI crawler. Defaults are sensible; flip whatever you disagree with.

4. Add your sitemap URL

The tool inserts the correct Sitemap: line at the bottom. If you don't have a sitemap yet, build one with our sitemap generator first.

5. Download and upload

The result is a plain text file. Upload it to your site's root so it appears at https://yoursite.com/robots.txt. That's the only valid location, robots.txt in a subdirectory does nothing.

6. Test it

Use Google Search Console → robots.txt Tester to verify the file is being read correctly and isn't blocking anything important. Type in a few URLs and confirm each one's allow/disallow status matches what you intended.

Common questions

How big can robots.txt be?

Google parses up to 500 KB. Beyond that it stops reading. In practice, a well-written robots.txt is under 5 KB. If yours is bigger, you're probably doing something wrong (listing individual URLs instead of patterns).

How often do crawlers recheck robots.txt?

Google typically rechecks every 24 hours, or whenever it can't fetch the file. Bing and others are similar. Changes generally take effect within a day.

What happens if robots.txt is missing or returns a 404?

Crawlers assume "everything allowed." Not catastrophic, but you lose the ability to direct crawl behavior and point to your sitemap.

What if robots.txt returns a 500 error?

Most crawlers (including Googlebot) treat a 5xx as "the file is temporarily unavailable, don't crawl anything until it's back." Sustained 5xx errors on robots.txt can dramatically reduce your crawl rate. Monitor it.

Can I block a specific URL parameter?

Yes:

Disallow: /*?sort=
Disallow: /*?utm_source=

This stops crawlers from indexing every parameter-variant of your URLs. Especially valuable on e-commerce.

Should I block search engines from my staging site?

Yes, but use HTTP authentication or IP allowlisting too, not just robots.txt. Anyone can guess staging.yourdomain.com and read its robots.txt.

The bottom line

robots.txt is a one-time setup that prevents Google from crawling junk URLs, blocks AI scrapers you don't want, and points to your sitemap. Build it once, test it in Search Console, and revisit it whenever your site structure changes. Don't disallow CSS, don't accidentally Disallow: /, and don't use it as a privacy tool. That's the whole game.

Generate a robots.txt free →

Comments (0) edit

No comments yet, be the first to share your thoughts.

Comments are moderated and appear after review. Your email is never shown publicly or shared.

Robots.txt: The Complete 2026 Guide (With AI Crawler Rules)

What robots.txt is (and isn't)

The anatomy of robots.txt

The rules every site should have

Wildcards and operators that actually work

The five mistakes that kill SEO

1. `Disallow: /` for all user agents

2. Blocking CSS and JavaScript

3. Blocking pages you want indexed

4. Inconsistent crawler groups

5. Missing sitemap reference

AI crawler rules (the 2026 question)

How to generate a correct robots.txt

1. Pick your CMS or platform

2. Add custom disallow paths

3. Set AI crawler preferences

4. Add your sitemap URL

5. Download and upload

6. Test it

Common questions

How big can robots.txt be?

How often do crawlers recheck robots.txt?

What happens if robots.txt is missing or returns a 404?

What if robots.txt returns a 500 error?

Can I block a specific URL parameter?

Should I block search engines from my staging site?

The bottom line

See also edit

Comments (0) edit

Leave a comment

Robots.txt: The Complete 2026 Guide (With AI Crawler Rules)

What robots.txt is (and isn't)

The anatomy of robots.txt

The rules every site should have

Wildcards and operators that actually work

The five mistakes that kill SEO

1. Disallow: / for all user agents

2. Blocking CSS and JavaScript

3. Blocking pages you want indexed

4. Inconsistent crawler groups

5. Missing sitemap reference

AI crawler rules (the 2026 question)

How to generate a correct robots.txt

1. Pick your CMS or platform

2. Add custom disallow paths

3. Set AI crawler preferences

4. Add your sitemap URL

5. Download and upload

6. Test it

Common questions

How big can robots.txt be?

How often do crawlers recheck robots.txt?

What happens if robots.txt is missing or returns a 404?

What if robots.txt returns a 500 error?

Can I block a specific URL parameter?

Should I block search engines from my staging site?

The bottom line

Related articles

How Browser-Based Tools Actually Work: A Plain-English Guide (2026)

How to Start a Blog and Rank on Google in 2026 (Complete Beginner SEO Guide)

How to Convert File Formats: The Complete Guide (Images, PDFs, Docs, Audio, Video)

See also edit

Comments (0) edit

Leave a comment

1. `Disallow: /` for all user agents