我正在尝试下载PDF,但是出现以下错误:HTTP错误403:禁止
我知道服务器由于某种原因而阻塞,但是我似乎找不到解决方案。
import urllib.request import urllib.parse import requests def download_pdf(url): full_name = "Test.pdf" urllib.request.urlretrieve(url, full_name) try: url = ('http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf') print('initialized') hdr = {} hdr = { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36', 'Content-Length': '136963', } print('HDR recieved') req = urllib.request.Request(url, headers=hdr) print('Header sent') resp = urllib.request.urlopen(req) print('Request sent') respData = resp.read() download_pdf(url) print('Complete') except Exception as e: print(str(e))
您似乎已经意识到这一点;远程服务器显然正在检查用户代理标头并拒绝来自Python的urllib的请求。但urllib.request.urlretrieve()不允许您更改HTTP标头,但是,您可以使用urllib.request.URLopener.retrieve():
urllib.request.urlretrieve()
urllib.request.URLopener.retrieve()
import urllib.request opener = urllib.request.URLopener() opener.addheader('User-Agent', 'whatever') filename, headers = opener.retrieve(url, 'Test.pdf')
注意:您正在使用Python 3,现在已将这些函数视为“旧版接口”的一部分,并且URLopener已弃用。因此,您不应在新代码中使用它们。
URLopener
除此之外,简单地访问URL会带来很多麻烦。您的代码会导入requests,但您不会使用-但是应该这么做,因为它比容易得多urllib。这对我有用:
requests
urllib
import requests url = 'http://papers.xtremepapers.com/CIE/Cambridge%20IGCSE/Mathematics%20(0580)/0580_s03_qp_1.pdf' r = requests.get(url) with open('0580_s03_qp_1.pdf', 'wb') as outfile: outfile.write(r.content)