我无法使用urllib2打开一个特定的url。同样的方法也适用于其他网站,例如“ http://www.google.com”,但不适用于该网站(在浏览器中也可以正常显示)。
我的简单代码:
from BeautifulSoup import BeautifulSoup import urllib2 url="http://www.experts.scival.com/einstein/" response=urllib2.urlopen(url) html=response.read() soup=BeautifulSoup(html) print soup
谁能帮我使它正常工作?
这是我得到的错误:
Traceback (most recent call last): File "/Users/jontaotao/Documents/workspace/MedicalSchoolInfo/src/AlbertEinsteinCollegeOfMedicine_SciValExperts/getlink.py", line 12, in <module> response=urllib2.urlopen(url); File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open response = meth(req, response) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 'http', request, response, code, msg, hdrs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 432, in error result = self._call_chain(*args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain result = func(*args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 619, in http_error_302 return self.parent.open(new, timeout=req.timeout) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 400, in open response = meth(req, response) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 513, in http_response 'http', request, response, code, msg, hdrs) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 438, in error return self._call_chain(*args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 372, in _call_chain result = func(*args) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py", line 521, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 404: Not Found
谢谢
我只是尝试了一下,并收到了404代码和页面返回。
据猜测,它正在执行User-Agent检测,无论是偶然还是有意都不将内容提供给python urllib。
进行了澄清,urllib我收到了urlopen返回的带有404代码和HTML内容的响应对象。随着urllib2.urlopen一个urllib2.HTTPError异常发生。
urllib
urlopen
urllib2.urlopen
urllib2.HTTPError
建议您尝试将用户代理设置为类似于浏览器的内容。这里有一个问题:在urllib2.urlopen上更改用户代理