我正在尝试使用googlesearch和报纸3k python软件包的组合来获取文章列表。当使用article.parse时,我最终得到一个错误:报纸.article.ArticleException:文章download()失败,出现403客户端错误:网址禁止:https : //www.newsweek.com/donald-trump-hillary- clinton-2020- URL上的rally-orlando-1444697 https://www.newsweek.com/donald-trump-hillary- clinton-2020-rally-orlando-1444697
download()
我已尝试在执行脚本时以admin身份运行,并且在浏览器中直接打开时该链接有效。
这是我的代码:
import googlesearch from newspaper import Article query = "trump" urlList = [] for j in googlesearch.search_news(query, tld="com", num=500, stop=200, pause=.01): urlList.append(j) print(urlList) articleList = [] for i in urlList: article = Article(i) article.download() article.html article.parse() articleList.append(article.text) print(article.text)
这是我的完整错误输出:
Traceback (most recent call last): File "C:/Users/andre/PycharmProjects/StockBot/WebCrawlerTest.py", line 31, in <module> article.parse() File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line 191, in parse self.throw_if_not_downloaded_verbose() File "C:\Users\andre\AppData\Local\Programs\Python\Python37\lib\site-packages\newspaper\article.py", line 532, in throw_if_not_downloaded_verbose (self.download_exception_msg, self.url)) newspaper.article.ArticleException: Article `download()` failed with 403 Client Error: Forbidden for url: https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697 on URL https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697
我希望它只是输出文章的文本。您能提供的任何帮助都会很棒。谢谢!
我通过更改用户代理使其工作
from newspaper import Article from newspaper import Config user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36' config = Config() config.browser_user_agent = user_agent page = Article("https://www.newsweek.com/donald-trump-hillary-clinton-2020-rally-orlando-1444697", config=config) page.download() page.parse() print(page.text)