Robots.txt Explained: How to Control Search Engine Crawling

The robots.txt file sits at your site root and tells search engines which pages to crawl and which to ignore. A misconfigured robots.txt can accidentally de-index your entire site.

Basic Syntax

User-agent: *
Disallow: /admin/
Disallow: /private/
Sitemap: https://yoursite.com/sitemap.xml

User-agent: * means these rules apply to all crawlers. You can target specific ones (Googlebot, Bingbot) with specific rules.

Disallow: tells the crawler not to visit that path. An empty Disallow (or omitting it) means everything is allowed.

Common Mistakes

Disallow: / blocks the entire site. Only do this intentionally during development.
Blocking CSS/JS: If you block Googlebot from CSS or JS, Google can't render your page properly for ranking.
Using robots.txt for security: It's a polite request, not a firewall. Anyone can ignore it. Use authentication, not robots.txt, to protect sensitive data.

Generate Your Robots.txt →

robots.txt is not a security mechanism. It tells well-behaved bots what NOT to crawl. Malicious bots ignore it. Never use it to hide sensitive data.

Best Practices

Always include a Sitemap: directive pointing to your sitemap.xml
Disallow admin, login, and dashboard paths
Disallow duplicate content paths (print versions, sorted listings)
Never block CSS, JavaScript, or images that are needed for rendering

Bottom Line

Every site needs a robots.txt. Keep it simple: allow everything by default, block admin/private paths, include your sitemap URL. Use our generator to create it correctly.