我试图使用python-requests库抓取此页面
import requests from lxml import etree,html url = 'http://www.amazon.in/b/ref=sa_menu_mobile_elec_all?ie=UTF8&node=976419031' r = requests.get(url) tree = etree.HTML(r.text) print tree
但我得到了上述错误.(TooManyRedirects)我试图使用allow_redirects
参数但同样的错误
r = requests.get(url, allow_redirects=True)
我甚至试图发送标题和数据以及网址,但我不确定这是否是正确的方法.
headers = {'content-type': 'text/html'} payload = {'ie':'UTF8','node':'976419031'} r = requests.post(url,data=payload,headers=headers,allow_redirects=True)
如何解决此错误.出于好奇,我甚至尝试过美丽的汤,但我得到了不同但同样的错误
page = BeautifulSoup(urllib2.urlopen(url))
urllib2.HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop. The last 30x error message was: Moved Permanently
Martijn Piet.. 25
亚马逊正在将您的请求http://www.amazon.in/b?ie=UTF8&node=976419031
重定向到http://www.amazon.in/electronics/b?ie=UTF8&node=976419031
,然后重定向到,之后您已进入循环:
>>> loc = url >>> seen = set() >>> while True: ... r = requests.get(loc, allow_redirects=False) ... loc = r.headers['location'] ... if loc in seen: break ... seen.add(loc) ... print loc ... http://www.amazon.in/b?ie=UTF8&node=976419031 http://www.amazon.in/electronics/b?ie=UTF8&node=976419031 >>> loc http://www.amazon.in/b?ie=UTF8&node=976419031
因此,您的原始网址A不会重定向新的网址B,重定向到C,重定向到B等.
显然,亚马逊基于User-Agent标头执行此操作,此时它会设置一个cookie,以便发送回请求.以下作品:
>>> s = requests.Session() >>> s.headers['User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36' >>> r = s.get(url) >>> r
这创建了一个会话(为了便于重复使用和cookie持久性),以及Chrome用户代理字符串的副本.请求成功(返回200响应).