Word to PDF and PDF to Word: Free Online Converters
Two conversions that come up constantly: turning a Word document into a fixed PDF for sharing, and turning a PDF back into an editable Word document. Both are now free, instant, and browser based.
The 200 most important bytes on your site. Get robots.txt right and crawlers behave; get it wrong and your site can vanish from Google. Here's what every directive does, including the new AI crawler rules.
robots.txt is the most powerful 200 bytes on your website. Get it right, and search engines crawl exactly the pages you want them to. Get it wrong, one stray slash, one misordered directive, and you can accidentally deindex your entire site, block important pages from Google, or leave private directories wide open.
This guide explains how robots.txt actually works in 2026, the specific rules every site should have, the mistakes that nuke SEO overnight, and how to build a correct file in two minutes.
A robots.txt file lives at your site's root (https://yoursite.com/robots.txt) and tells web crawlers, Googlebot, Bingbot, AI scrapers, and others, which paths they can access and which they can't. Crawlers check this file before crawling your site.
It is:
It is not:
noindex meta tags for true removal.A file is made of one or more groups, each starting with a User-agent: line. Within a group, you have Allow: and Disallow: directives. Outside any group, Sitemap: declarations.
User-agent: Googlebot
Disallow: /admin/
Disallow: /private/
Allow: /private/public-resource.pdf
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Sitemap: https://example.com/sitemap.xml
Reading top to bottom: Googlebot can't crawl /admin/ or /private/, except it can crawl /private/public-resource.pdf. All other crawlers can't crawl /cart/ or /checkout/. Both can find the sitemap at the URL given.
These four rules apply to almost every site and won't break anything:
User-agent: *
Disallow: /wp-admin/ # WordPress only, admin area
Disallow: /admin/ # generic admin area
Disallow: /cart/ # e-commerce only, shopping cart
Disallow: /checkout/ # e-commerce only, checkout
Disallow: /search/ # internal search results
Disallow: /*?s= # WordPress search URLs
Disallow: /*?orderby= # e-commerce sort URLs (duplicate content)
Sitemap: https://yoursite.com/sitemap.xml
Adjust the paths to match your CMS. The principle: disallow anything that has no SEO value or generates infinite URL variations (search results, sort/filter pages, user accounts, carts).
The robots.txt spec is small but has a few useful features:
*, matches any sequence of characters. /private/ matches /private/anything/here.$, matches the end of a URL. /*.pdf$ matches only URLs ending in .pdf.#, starts a comment. Everything after # on a line is ignored.Examples:
Disallow: /*.pdf$ # block crawling of all PDFs
Disallow: /tag/* # block all tag archive pages
Disallow: /*? # block all URLs containing a question mark
Allow: /tag/featured # exception: allow this one tag archive
Order matters for Allow/Disallow conflicts. Google's rule: the most specific match wins (longest matching path), regardless of order in the file. Some other crawlers use file order, but Googlebot uses specificity.
Disallow: / for all user agentsUser-agent: *
Disallow: /
This blocks every crawler from every URL on your site. If you accidentally push this to production, your site disappears from Google within a week.
It's depressingly common, usually because a developer set it on a staging server and the file got copied over to production.
Old advice said to block /css/ and /js/ to save crawl budget. Don't. Google needs to render your pages to understand them. If CSS and JS are blocked, your pages may look broken to Google and lose rankings.
Specifically wrong:
Disallow: /wp-content/themes/
Disallow: /wp-includes/
Disallow: /assets/
If your site uses these paths and you block them, Google can't see your styling or interactivity and may judge the site harshly. Leave them allowed.
If you Disallow: a page in robots.txt, Google can't read it, and therefore can't read its noindex meta tag, canonical tag, or content. The page may still appear in search results as a URL-only listing if other sites link to it.
For pages you want to keep out of Google, use noindex in the page's <head>. Don't use robots.txt for that.
User-agent: Googlebot
Disallow: /
User-agent: *
Allow: /
Googlebot reads its own group (which says block everything) and ignores the wildcard group. Result: blocked from Google. The * group only applies to crawlers that don't have their own group.
If you want Googlebot to follow the same rules as everyone else, don't give it a separate group at all.
User-agent: *
Disallow: /admin/
No Sitemap: line. Google might still find your sitemap if you've submitted it in Search Console, but Bing, DuckDuckGo, and other engines may miss it entirely. Always include the line.
A new category of crawlers fetches your content to train AI models or feed AI search engines: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google's AI training crawler, separate from Googlebot), PerplexityBot, CCBot (Common Crawl), and others.
Some sites want to block these; others want their content represented in AI answers. There's no universal right choice. To block all AI training crawlers while still allowing search:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: CCBot
Disallow: /
Important: blocking Google-Extended does not block Googlebot. The two are separate user agents, one for search, one for AI training. You can block AI training while keeping search.
If you want your content to be discoverable in AI search but not used for training, the situation is murkier, most crawlers don't yet distinguish "training" from "answering," and policies are evolving. Many publishers in 2026 have decided to allow ClaudeBot and PerplexityBot (which produce direct citations and traffic) while blocking GPTBot and CCBot (which produce neither). Your call.
Using our free robots.txt generator:
WordPress, Shopify, Webflow, Squarespace, Wix, custom. The tool pre-fills the standard blocks for that platform, /wp-admin/ for WordPress, /admin/ and /cart/ for Shopify, etc.
Type any additional paths to block, staging directories, internal tools, file types you don't want indexed.
Toggle whether to allow or block each major AI crawler. Defaults are sensible; flip whatever you disagree with.
The tool inserts the correct Sitemap: line at the bottom. If you don't have a sitemap yet, build one with our sitemap generator first.
The result is a plain text file. Upload it to your site's root so it appears at https://yoursite.com/robots.txt. That's the only valid location, robots.txt in a subdirectory does nothing.
Use Google Search Console → robots.txt Tester to verify the file is being read correctly and isn't blocking anything important. Type in a few URLs and confirm each one's allow/disallow status matches what you intended.
Google parses up to 500 KB. Beyond that it stops reading. In practice, a well-written robots.txt is under 5 KB. If yours is bigger, you're probably doing something wrong (listing individual URLs instead of patterns).
Google typically rechecks every 24 hours, or whenever it can't fetch the file. Bing and others are similar. Changes generally take effect within a day.
Crawlers assume "everything allowed." Not catastrophic, but you lose the ability to direct crawl behavior and point to your sitemap.
Most crawlers (including Googlebot) treat a 5xx as "the file is temporarily unavailable, don't crawl anything until it's back." Sustained 5xx errors on robots.txt can dramatically reduce your crawl rate. Monitor it.
Yes:
Disallow: /*?sort=
Disallow: /*?utm_source=
This stops crawlers from indexing every parameter-variant of your URLs. Especially valuable on e-commerce.
Yes, but use HTTP authentication or IP allowlisting too, not just robots.txt. Anyone can guess staging.yourdomain.com and read its robots.txt.
robots.txt is a one-time setup that prevents Google from crawling junk URLs, blocks AI scrapers you don't want, and points to your sitemap. Build it once, test it in Search Console, and revisit it whenever your site structure changes. Don't disallow CSS, don't accidentally Disallow: /, and don't use it as a privacy tool. That's the whole game.
No comments yet, be the first to share your thoughts.
Comments are moderated and appear after review. Your email is never shown publicly or shared.