seen before Zan Hui's "se real password", which talked about the robots.txt just personal feeling is still very detailed, nor studied how large sites are set up, today think, to analyze the next domestic microblogging Sina, Tencent, Sohu, Netease four platforms of their robots.txt file set, robots how to write.
1. Sina microblogging
Description: allow all search engine spiders
(2) Tencent microblogging
Description: allow all search engine spiders, except for some system files. And the addition of two site map is a certified member of a personal home page at the end of microblogging address, and the other microblogging message address. Site Map xml format has a limit is a maximum map file list 50000 url, a file can not be greater than 10m, over many things can put a new site map, isolated specifically to investigate the next vine Tencent first microblogging an xml map, the map file, there are 41,000 or so url, 2m much. Over time and then take a look at Tencent is also dealing with too many new sites map the url.
3. Sohu microblogging
Sohu microblogging is the most interesting, because a few months before the rise of quick keyword ranking is Sohu microblogging with its high weight, and later the legendary shield microblogging Sohu, Baidu spider, let's look at the robots.txt file. The first part of the statement is to allow Baidu spider crawl, and the second part of the statement is to allow the search dog to crawl, and the third part of the statement is to prohibit all search search engine spiders.
Baidu, according to official documents say - need special attention Disallow and Allow lines in the order is meaningful, robot based on the success of the first matching Allow or Disallow lines to determine whether access a URL.
so the last part of Baidu, and Sogou statement is invalid. That microblogging Sohu and Baidu Sogou only allowed to crawl the page.
Another point here is that Sohu microblogging solitary vine found in the robots.txt about June or so has been modified to mask out Baidu, Sogou, all other search engines crawl, but the other search engine still does the index have been included to increase the amount of difference is that Google, the proper way, bing just the index, not included. Search the file does not seem to support the robot, or how, still contains a snapshot of the extract descriptive text. Yahoo also still included, but snapshots do not see, can not determine whether it is merely an index.
4. Netease microblogging
NetEase microblogging robots can not find the file
Let's look at four major blog platforms Indexed:
Baidu Baidu day total collection included (half day) Note Sina microblogging 870