# robots.txt

# robots.txt 是什么

robots.txt 规范官网： https://www.robotstxt.org/

robots.txt 是位于网站根目录下的纯文本文件，用于告知网络爬虫"改网站中哪些可以被爬取，哪些不可以被爬取"。

如：淘宝： https://taobao.com/robots.txt , 腾讯 https://www.qq.com/robots.txt

robots.txt 只是约定俗成的协议，Google、百度、bing等爬虫都会遵守该规范，但并非所有搜索引擎都支持，故通过robots.txt无法 100% 地保证爬取效果（屏蔽效果）。

# robots.txt 怎么用

允许所有的爬虫：

User-agent: *
Allow:/

仅允许特定的爬虫：（name_spider用真实名字代替，具体爬虫名字可在附录连接中查看）

User-agent: name_spider
Allow: /

拦截所有的爬虫：

User-agent: *
Disallow: /

禁止所有爬虫访问特定目录：

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/

仅禁止坏爬虫访问特定目录（BadBot用真实的名字代替）：

User-agent: BadBot
Disallow: /private/

禁止所有爬虫访问特定文件类型：

User-agent: *
Disallow: /*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$

允许所有爬虫进行访问

User-agent: *
Disallow:

# 非标准扩展协议

# Sitemap

目前主流搜索引擎均支持Sitemap，sitemap用于告诉浏览器网站都包含哪些URL，

使用方法（写到 robots.txt 中）

Sitemap: <path-to-sitemap.xml>

sitemap 文件内容格式如下：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:mobile="http://www.google.com/schemas/sitemap-mobile/1.0" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xmlns:video="http://www.google.com/schemas/sitemap-video/1.1">
    <url>
        <loc>https://z.wiki/</loc>
        <lastmod>2022-04-16T12:42:45.000Z</lastmod>
        <changefreq>daily</changefreq>
    </url>
    <url>
        <loc>https://z.wiki/life/</loc>
        <lastmod>2022-02-05T14:55:06.000Z</lastmod>
        <changefreq>daily</changefreq>
    </url>
    <url>
        <loc>https://z.wiki/life/bento.html</loc>
        <lastmod>2022-03-28T14:56:49.000Z</lastmod>
        <changefreq>daily</changefreq>
    </url>
</urlset>

# Crawl-delay 指令

Crawl-delay参数设置爬虫的爬取时间间隔，避免对服务器的性能造成影响

User-agent: *
Crawl-delay: 10
# 每次爬取等待10秒后继续爬取其他链接

# 其他替代品一

robots.txt是最为广泛使用的方法，此外也可以通过robots Meta标签针对特定页面做设置。

<head>
	<meta name="robots" content="noindex,nofollow" />
</head>

详细含义如下：

content 内容	含义
all	对索引编制或内容显示无任何限制。该规则为默认值，因此明确列出时并无任何效果
noindex	不在搜索结果中显示此网页、媒体或资源。如果您未指定该规则，则此网页、媒体或资源可能会编入索引并显示在搜索结果中。
nofollow	不追踪该网页上的链接
none	等同于 noindex, nofollow
noarchive	不在搜索结果中显示缓存
nositelinkssearchbox	不在搜索结果中显示该网页的站点链接搜索框
nosnippet	在搜索结果中显示该网页的文本摘要或视频预览
indexifembedded	如果网页内容通过 iframes 或类似 HTML 标记嵌入到其他网页中，那么搜索引擎可以将该网页内容编入索引
unavailable_after: [date/time]	在指定日期/时间过后，不在搜索结果中显示该网页

# 其他替代品二

除了 robots meta外，我们还可以通过 http响应头来设置爬取策略,如：

HTTP/1.1 200 OK
Date: Tue, 25 May 2010 21:42:43 GMT
...
X-Robots-Tag: robots: noindex // 禁止爬虫进行爬取，其他关键词可以参考 robots meta 中的 conent 含义
...

# 案例分析

# 腾讯

腾讯官网为https://www.tencent.com，对应的 robots.txt链接为https://www.tencent.com/robots.txt ，内容如下：

User-agent: *
Disallow:

从以上配置中可看出，腾讯官网对任意爬虫未设置禁止爬取的规则，既：整个腾讯官网允许任意爬虫进行爬取，那如何验证爬虫爬取了腾讯官网了？这里就用到了搜索小技巧site指令了。

通过百度搜索以下关键字腾讯 site:www.tencent.com，如下图，我们能搜到腾讯官网上很多内容。

# 淘宝

2008年9月8日，淘宝网正式向百度宣战：淘宝网将屏蔽百度的搜索引擎抓取。

from https://www.guayunfan.com/baike/305946.html

淘宝屏蔽了百度爬虫，在技术上有很多种手段，但在这里我们只讨论robots.txt，查看淘宝的robots.txt，内容如下：

User-agent: Baiduspider
Disallow: /

User-agent: baiduspider
Disallow: /

唯独屏蔽百度爬虫，有意思，有意思！

试试使用百度搜索淘宝上的东西吧，比如：手机，效果如下：

嗯嗯，整体效果还是不错的，搜索到的内容没有是taobao.com域名下的，不过有cpcwi.taobao.com域名下的内容，难道有漏网之鱼？看下这个二级域名下的robots.txt，内容如下：

User-agent: *
Disallow: /

咳咳，cpcwi.taobao.com这个域名是禁止所有爬虫的，难道是百度爬虫不道德？使用百度和必应搜索手机 site:cpcwi.taobao.com对比下结果吧。

哈哈，果然是百度不道德了😒😒😒

回到淘宝，淘宝专门屏蔽了百度，却没有屏蔽其他爬虫，那我们用必应搜索手机 site:taobao.com应该能搜索内容才对，看下吧：

果不其然

# 附录

爬虫列表 https://www.robotstxt.org/db.html
国内常见爬虫 https://www.baidu.com/robots.txt

百度：Baiduspider
谷歌：Googlebot
微软：MSNBot
百度图片：Baiduspider-image
有道：YoudaoBot
搜狗：Sogou web spider
搜狗学术：Sogou inst spider
搜狗：Sogou spider2
搜狗博客：Sogou blog
搜狗新闻：Sogou News Spider
搜狗：Sogou Orion spider
中搜：ChinasoSpider
搜搜：Sosospider
宜搜：yisouspider
宜搜：EasouSpider

← idea 插件篇 z.wiki →