周末大清早收到封警报邮件,估计网站被攻击了,要么就是缓存日志memory的问题。打开access.log 看了一眼,原来该时间段内大波的bot(bot: 网上机器人;自动程序 a computer programthat performs a particular task again and again many times)访问了我的网站。
website.com (AWS) - Monitor is Down
Down since Mar 25, 2017 1:38:58 AM CET
(资料图)
Site Monitored
http://www.website.com
Resolved IP
54.171.32.xx
Reason
Service Unavailable.
Monitor Group
XX Applications
Outage Details
LocationResolved IPReasonLondon - UK (5.77.35.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxxSeattle - US (104.140.20.xx)54.171.32.xxService Unavailable.Headers : HTTP/1.1 503 Service Unavailable: Back-end server is at capacity Content-Length : 0 Connection : keep-alive GET / HTTP/1.1 Cache-Control : no-cache Accept : */* Connection : Keep-Alive Accept-Encoding : gzip User-Agent : Site24x7 Host : xxx
上网搜了一下,发现许多webmaster都遇到了由于bot短期密集访问形成的流量高峰而无法其它终端提供服务的问题。从这篇文章的分析中,我们看到有这样几种方法来block这些web bot。
1. robots.txt
许多网络爬虫都是先去搜索robots.txt,如下所示:
"199.58.86.206" - - [25/Mar/2017:01:26:50 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "199.58.86.206" - - [25/Mar/2017:01:26:54 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)" "162.210.196.98" - - [25/Mar/2017:01:39:18 +0000] "GET /robots.txt HTTP/1.1" 404 341 "-" "Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/)"
许多bot的发布者也谈到了如果不希望被爬取,应该如何来操作,就以MJ12bot为例:
How can I block MJ12bot?
MJ12bot adheres to the robots.txtstandard. If you want the bot to prevent website from being crawled then add the following text to your robots.txt:
User-agent: MJ12bot
Disallow: /
Please do not waste your time trying to block bot via IP in htaccess - we do not use any consecutive IP blocks so your efforts will be in vain. Also please make sure the bot can actually retrieve robots.txt itself - if it can"t then it will assume (this is the industry practice) that its okay to crawl your site.
If you have reason to believe that MJ12bot did NOT obey your robots.txt commands, then please let us know via email: bot@majestic12.co.uk. Please provide URL to your website and log entries showing bot trying to retrieve pages that it was not supposed to.
How can I slow down MJ12bot?
You can easily slow down bot by adding the following to your robots.txt file:
User-Agent: MJ12bot
Crawl-Delay: 5
Crawl-Delay should be an integer number and it signifies number of seconds of wait between requests. MJ12bot will make an up to 20 seconds delay between requests to your site - note however that while it is unlikely, it is still possible your site may have been crawled from multiple MJ12bots at the same time. Making high Crawl-Delay should minimise impact on your site. This Crawl-Delay parameter will also be active if it was used for * wildcard.
If our bot detects that you used Crawl-Delay for any other bot then it will automatically crawl slower even though MJ12bot specifically was not asked to do so.
那么我们可以写如下的
User-agent: YisouSpider
Disallow: /
User-agent: EasouSpider
Disallow: /
User-agent: EtaoSpider
Disallow: /
User-agent: MJ12bot
Disallow: /
另外,鉴于很多bot都会去访问这些目录:
/wp-login.php /wp-admin/
/trackback/
/?replytocom=
…
许多WordPress网站也确实用到了这些文件夹,那么如何在不影响功能的情况下做一些调整呢?
robots.txt修改之前robots.txt修改之后
User-agent: *
Disallow: /wp-admin
Disallow: /wp-content/plugins
Disallow: /wp-content/themes
Disallow: /wp-includes
Disallow: /?s=User-agent: *
Disallow: /wp-admin
Disallow: /wp-*
Allow: /wp-content/uploads/
Disallow: /wp-content
Disallow: /wp-login.php
Disallow: /comments
Disallow: /wp-includes
Disallow: /*/trackback
Disallow: /*?replytocom*
Disallow: /?p=*&preview=true
Disallow: /?s=
不过,也可以看到许多爬虫并不理会robots.txt,以这个为例,就没有先去访问robots.txt
"10.70.8.30, 163.172.65.40" - - [25/Mar/2017:02:13:36 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:13:42 +0000] "GET / HTTP/1.1" 200 129989 "-" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/js/utils.js HTTP/1.1" 200 5345 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)" "178.63.23.67, 163.172.65.40" - - [25/Mar/2017:02:14:17 +0000] "GET /static/css/home.css HTTP/1.1" 200 8511 "http://iatatravelcentre.com/" "Mozilla/5.0 (compatible; AhrefsBot/5.2; +http://ahrefs.com/robot/)"
这个时候就要试一下其他几种方法。
2. .htaccess
原理就是利用URL rewrite,只要发现访问来自于这些agent,就禁止其访问。作者“~吉尔伽美什”的这篇文章介绍了关于.htaccess的很多用法。
5. Blocking users by IP 根据IP阻止用户访问order allow,deny deny from 123.45.6.7 deny from 12.34.5. (整个C类地址) allow from all 6. Blocking users/sites by referrer 根据referrer阻止用户/站点访问需要mod_rewrite模块 例1. 阻止单一referrer: badsite.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite\.com [NC] RewriteRule .* - [F] 例2. 阻止多个referrer: badsite1.com, badsite2.comRewriteEngine on # Options +FollowSymlinks RewriteCond %{HTTP_REFERER} badsite1\.com [NC,OR] RewriteCond %{HTTP_REFERER} badsite2\.com RewriteRule .* - [F] [NC] - 大小写不敏感(Case-insensite) [F] - 403 Forbidden 注意以上代码注释掉了”Options +FollowSymlinks”这个语句。如果服务器未在 httpd.conf 的 段落设置 FollowSymLinks, 则需要加上这句,否则会得到”500 Internal Server error”错误。 7. Blocking bad bots and site rippers (aka offline browsers) 阻止坏爬虫和离线浏览器需要mod_rewrite模块 坏爬虫? 比如一些抓垃圾email地址的爬虫和不遵守robots.txt的爬虫(如baidu?) 可以根据 HTTP_USER_AGENT 来判断它们 (但是还有更无耻的如”中搜 zhongsou.com”之流把自己的agent设置为 “Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)” 太流氓了,就无能为力了) RewriteEngine On RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR] RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR] RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR] RewriteCond %{HTTP_USER_AGENT} ^Custo [OR] RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR] RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR] RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR] RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR] RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR] RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR] RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR] RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR] RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR] RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR] RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR] RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR] RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR] RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR] RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR] RewriteCond %{HTTP_USER_AGENT} ^HMView [OR] RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR] RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR] RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR] RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR] RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR] RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR] RewriteCond %{HTTP_USER_AGENT} ^larbin [OR] RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR] RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR] RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR] RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR] RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR] RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR] RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR] RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR] RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR] RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR] RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR] RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR] RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR] RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR] RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR] RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR] RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR] RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR] RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR] RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR] RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR] RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR] RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR] RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR] RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR] RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR] RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR] RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR] RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR] RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR] RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR] RewriteCond %{HTTP_USER_AGENT} ^Wget [OR] RewriteCond %{HTTP_USER_AGENT} ^Widow [OR] RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR] RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR] RewriteCond %{HTTP_USER_AGENT} ^Zeus RewriteRule ^.* - [F,L] [F] - 403 Forbidden [L] - 连接(Link) 8. Change your default directory page 改变缺省目录页面 DirectoryIndex index.html index.php index.cgi index.pl 9. Redirects 转向单个文件Redirect /old_dir/old_file.html http://yoursite.com/new_dir/new_file.html 整个目录Redirect /old_dir http://yoursite.com/new_dir 效果: 如同将目录移动位置一样 http://yoursite.com/old_dir -> http://yoursite.com/new_dir http://yoursite.com/old_dir/dir1/test.html -> http://yoursite.com/new_dir/dir1/test.html Tip: 使用用户目录时Redirect不能转向的解决方法当你使用Apache默认的用户目录,如 http://mysite.com/~windix,当你想转向 http://mysite.com/~windix/jump时,你会发现下面这个Redirect不工作: Redirect /jump http://www.google.com 正确的方法是改成 Redirect /~windix/jump http://www.google.com (source: .htaccess Redirect in “Sites” not redirecting: why? ) 10. Prevent viewing of .htaccess file 防止.htaccess文件被查看 order allow,deny deny from all
3. 拒绝IP的访问
可以在Apache配置文件httpd.conf指明拒绝来自某些IP的访问。
Order allow,deny
Allow from all
Deny from5.9.26.210
Deny from162.243.213.131
但是由于很多时候,这些访问的IP并不固定,所以这种方法不太方便,而且修改了httpd.conf还要重启apache才能生效,所以建议采用修改.htaccess。
3月28日,中国煤炭工业协会发布2022煤炭行业发展年度报告。报告指出,10年来,煤炭生产开发布局深度调整,部分省(市)由于煤炭资源枯竭、更多
2023-03-29 11:03:143月28日,随着满载8000吨煤炭的71453次列车从国铁西安局靖边东站缓缓驶出,浩吉铁路煤炭运量正式突破2亿吨大关,能源运输大通道作用进一步更多
2023-03-29 11:09:24预计今年我国煤炭市场供需将保持基本平衡态势,煤炭需求将保持适度增加,煤炭产量将保持增长、增幅回落,进口煤进一步发挥调节补充国内煤更多
2023-03-29 11:15:27伴随着一声清脆的鸣笛,3月24日下午17点20分,装满新疆煤的火车从洗选中心清水营洗煤厂装车站铁路转运线缓缓驶出发往黄骅港,疆煤进宁首列更多
2023-03-29 09:53:23安徽省能源局消息,安徽省聚焦重大灾害治理、重大风险管控等重点任务,于今年3月至12月开展煤矿安全生产综合整治。 整治期间,将集中开更多
2023-03-29 09:56:05中国煤炭工业协会28日召开新闻发布会,发布2022年煤炭经济运行情况及2023年市场走势预测。 煤炭市场现货价格向合理区间回归 2022年更多
2023-03-29 10:03:27中国煤炭工业协会28日发布《2022煤炭行业发展年度报告》(以下简称《报告》)。《报告》显示,煤炭行业发展新动能新优势持续增强。这10年更多
2023-03-29 10:07:05冬去春来,蓄势新生。新年以来长协履约较好,市场煤波动加剧,各煤种间差异性明显。国内产量及进口维持高位,安监趋严下保供仍然较好完成更多
2023-03-29 10:07:31根据《煤矿企业安全生产许可证实施办法》有关规定,现对河南省煤矿企业安全生产许可证情况予以公示。附件:全省煤矿企业安全生产许可证情况公更多
2023-03-28 10:15:033月21日,笔者从中国平煤神马集团获悉,该集团两家上市公司平煤股份和神马股份可转换公司债券(以下简称可转债)顺利完成发行,本次可转债是该更多
2023-03-28 10:00:45