robots.txt is the most powerful 200 bytes on your website. Get it right, and search engines crawl exactly the pages you want them to. Get it wrong — one stray slash, one misordered directive — and you can accidentally deindex your entire site, block important pages from Google, or leave private directories wide open.
This guide explains how robots.txt actually works in 2026, the specific rules every site should have, the mistakes that nuke SEO overnight, and how to build a correct file in two minutes.
What robots.txt is (and isn't)
A robots.txt file lives at your site's root (https://yoursite.com/robots.txt) and tells web crawlers — Googlebot, Bingbot, AI scrapers, and others — which paths they can access and which they can't. Crawlers check this file before crawling your site.
It is:
- A polite request, not a security boundary
- The first thing Google reads on every crawl
- Specific to each user-agent (different rules for different crawlers)
- The right place to point to your sitemap
It is not:
- A way to hide private content. Anyone can read your robots.txt, and the URLs you list as "Disallow" are now public knowledge. Real security comes from authentication, not robots.txt.
- A way to remove a page from Google. Disallow blocks crawling, but if other sites link to the URL, Google may still list it in results with no description. Use
noindex meta tags for true removal.
- Honored by every crawler. Reputable search engines obey it; malicious scrapers ignore it entirely.
The anatomy of robots.txt
A file is made of one or more groups, each starting with a User-agent: line. Within a group, you have Allow: and Disallow: directives. Outside any group, Sitemap: declarations.
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Allow: /private/public-resource.pdf
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xml
Reading top to bottom: Googlebot can't crawl /admin/ or /private/, except it can crawl /private/public-resource.pdf. All other crawlers can't crawl /cart/ or /checkout/. Both can find the sitemap at the URL given.
The rules every site should have
These four rules apply to almost every site and won't break anything:
User-agent: *
Disallow: /wp-admin/ # WordPress only — admin area
Disallow: /admin/ # generic admin area
Disallow: /cart/ # e-commerce only — shopping cart
Disallow: /checkout/ # e-commerce only — checkout
Disallow: /search/ # internal search results
Disallow: /*?s= # WordPress search URLs
Disallow: /*?orderby= # e-commerce sort URLs (duplicate content)
Sitemap: https://yoursite.com/sitemap.xml
Adjust the paths to match your CMS. The principle: disallow anything that has no SEO value or generates infinite URL variations (search results, sort/filter pages, user accounts, carts).
Wildcards and operators that actually work
The robots.txt spec is small but has a few useful features:
- *
* — matches any sequence of characters. /private/ matches /private/anything/here.
$ — matches the end of a URL. /*.pdf$ matches only URLs ending in .pdf.
# — starts a comment. Everything after # on a line is ignored.
Examples:
Disallow: /*.pdf$ # block crawling of all PDFs
Disallow: /tag/* # block all tag archive pages
Disallow: /*? # block all URLs containing a question mark
Allow: /tag/featured # exception: allow this one tag archive
Order matters for Allow/Disallow conflicts. Google's rule: the most specific match wins (longest matching path), regardless of order in the file. Some other crawlers use file order, but Googlebot uses specificity.
The five mistakes that kill SEO
1. Disallow: / for all user agents
User-agent: *
Disallow: /
This blocks every crawler from every URL on your site. If you accidentally push this to production, your site disappears from Google within a week.
It's depressingly common — usually because a developer set it on a staging server and the file got copied over to production.
2. Blocking CSS and JavaScript
Old advice said to block /css/ and /js/ to save crawl budget. Don't. Google needs to render your pages to understand them. If CSS and JS are blocked, your pages may look broken to Google and lose rankings.
Specifically wrong:
Disallow: /wp-content/themes/
Disallow: /wp-includes/
Disallow: /assets/
If your site uses these paths and you block them, Google can't see your styling or interactivity and may judge the site harshly. Leave them allowed.
3. Blocking pages you want indexed
If you Disallow: a page in robots.txt, Google can't read it — and therefore can't read its noindex meta tag, canonical tag, or content. The page may still appear in search results as a URL-only listing if other sites link to it.
For pages you want to keep out of Google, use noindex in the page's <head>. Don't use robots.txt for that.
4. Inconsistent crawler groups
User-agent: Googlebot
Disallow: /
User-agent: *
Allow: /
Googlebot reads its own group (which says block everything) and ignores the wildcard group. Result: blocked from Google. The * group only applies to crawlers that don't have their own group.
If you want Googlebot to follow the same rules as everyone else, don't give it a separate group at all.
5. Missing sitemap reference
User-agent: *
Disallow: /admin/
No Sitemap: line. Google might still find your sitemap if you've submitted it in Search Console, but Bing, DuckDuckGo, and other engines may miss it entirely. Always include the line.
AI crawler rules (the 2026 question)
A new category of crawlers fetches your content to train AI models or feed AI search engines: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI training crawler, separate from Googlebot), PerplexityBot, CCBot (Common Crawl), and others.
Some sites want to block these; others want their content represented in AI answers. There's no universal right choice. To block all AI training crawlers while still allowing search:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
Important: blocking Google-Extended does not block Googlebot. The two are separate user agents — one for search, one for AI training. You can block AI training while keeping search.
If you want your content to be discoverable in AI search but not used for training, the situation is murkier — most crawlers don't yet distinguish "training" from "answering," and policies are evolving. Many publishers in 2026 have decided to allow ClaudeBot and PerplexityBot (which produce direct citations and traffic) while blocking GPTBot and CCBot (which produce neither). Your call.
How to generate a correct robots.txt
Using our free robots.txt generator:
1. Pick your CMS or platform
WordPress, Shopify, Webflow, Squarespace, Wix, custom. The tool pre-fills the standard blocks for that platform — /wp-admin/ for WordPress, /admin/ and /cart/ for Shopify, etc.
2. Add custom disallow paths
Type any additional paths to block — staging directories, internal tools, file types you don't want indexed.
3. Set AI crawler preferences
Toggle whether to allow or block each major AI crawler. Defaults are sensible; flip whatever you disagree with.
4. Add your sitemap URL
The tool inserts the correct Sitemap: line at the bottom. If you don't have a sitemap yet, build one with our sitemap generator first.
5. Download and upload
The result is a plain text file. Upload it to your site's root so it appears at https://yoursite.com/robots.txt. That's the only valid location — robots.txt in a subdirectory does nothing.
6. Test it
Use Google Search Console → robots.txt Tester to verify the file is being read correctly and isn't blocking anything important. Type in a few URLs and confirm each one's allow/disallow status matches what you intended.
Common questions
How big can robots.txt be?
Google parses up to 500 KB. Beyond that it stops reading. In practice, a well-written robots.txt is under 5 KB. If yours is bigger, you're probably doing something wrong (listing individual URLs instead of patterns).
How often do crawlers recheck robots.txt?
Google typically rechecks every 24 hours, or whenever it can't fetch the file. Bing and others are similar. Changes generally take effect within a day.
What happens if robots.txt is missing or returns a 404?
Crawlers assume "everything allowed." Not catastrophic, but you lose the ability to direct crawl behavior and point to your sitemap.
What if robots.txt returns a 500 error?
Most crawlers (including Googlebot) treat a 5xx as "the file is temporarily unavailable, don't crawl anything until it's back." Sustained 5xx errors on robots.txt can dramatically reduce your crawl rate. Monitor it.
Can I block a specific URL parameter?
Yes:
Disallow: /*?sort=
Disallow: /*?utm_source=
This stops crawlers from indexing every parameter-variant of your URLs. Especially valuable on e-commerce.
Should I block search engines from my staging site?
Yes — but use HTTP authentication or IP allowlisting too, not just robots.txt. Anyone can guess staging.yourdomain.com and read its robots.txt.
The bottom line
robots.txt is a one-time setup that prevents Google from crawling junk URLs, blocks AI scrapers you don't want, and points to your sitemap. Build it once, test it in Search Console, and revisit it whenever your site structure changes. Don't disallow CSS, don't accidentally Disallow: /, and don't use it as a privacy tool. That's the whole game.
Generate a robots.txt free →
No comments yet — be the first to share your thoughts.
Comments are moderated and appear after review. Your email is never shown publicly or shared.