CrawlerProcess与CrawlerRunner

小编典典

CrawlerProcess与CrawlerRunner

python

Scrapy
1.x文档说明，有两种方法可以 通过脚本运行Scrapy Spider ：

使用 CrawlerProcess
使用 CrawlerRunner

两者有什么区别？我什么时候应该使用“进程”，什么时候应该使用“运行器”？

阅读 216

2021-01-20

共1个答案

小编典典

Scrapy的文档在给出两者的实际应用示例方面做得非常糟糕。

CrawlerProcess假设只有刮板才是使用扭曲的电抗器的唯一方法。如果您在python中使用线程来运行其他代码，则并非总是如此。让我们以此为例。

from scrapy.crawler import CrawlerProcess
import scrapy
def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...
class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
notThreadSafe(3) # it will get executed when the crawlers stop

现在，您可以看到，该函数仅在搜寻器停止时才执行，如果我希望在搜寻器在同一反应堆中爬行时执行该函数，该怎么办？

from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
import scrapy

def notThreadSafe(x):
    """do something that isn't thread-safe"""
    # ...

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.callFromThread(notThreadSafe, 3)
reactor.run() #it will run both crawlers and code inside the function

Runner类不限于此功能，您可能需要在反应堆上进行一些自定义设置（延迟，线程，getPage，自定义错误报告等）

2021-01-20