Scrapy 1.x文档说明,有两种方法可以 通过脚本运行Scrapy Spider :
CrawlerProcess
CrawlerRunner
两者有什么区别?我什么时候应该使用“进程”,什么时候应该使用“运行器”?
Scrapy的文档在给出两者的实际应用示例方面做得非常糟糕。
CrawlerProcess假设只有刮板才是使用扭曲的电抗器的唯一方法。如果您在python中使用线程来运行其他代码,则并非总是如此。让我们以此为例。
from scrapy.crawler import CrawlerProcess import scrapy def notThreadSafe(x): """do something that isn't thread-safe""" # ... class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... process = CrawlerProcess() process.crawl(MySpider1) process.crawl(MySpider2) process.start() # the script will block here until all crawling jobs are finished notThreadSafe(3) # it will get executed when the crawlers stop
现在,您可以看到,该函数仅在搜寻器停止时才执行,如果我希望在搜寻器在同一反应堆中爬行时执行该函数,该怎么办?
from twisted.internet import reactor from scrapy.crawler import CrawlerRunner import scrapy def notThreadSafe(x): """do something that isn't thread-safe""" # ... class MySpider1(scrapy.Spider): # Your first spider definition ... class MySpider2(scrapy.Spider): # Your second spider definition ... runner = CrawlerRunner() runner.crawl(MySpider1) runner.crawl(MySpider2) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.callFromThread(notThreadSafe, 3) reactor.run() #it will run both crawlers and code inside the function
Runner类不限于此功能,您可能需要在反应堆上进行一些自定义设置(延迟,线程,getPage,自定义错误报告等)