小编典典

scrapy-解析分页的项目

python

我有一个形式的网址:

example.com/foo/bar/page_1.html

总共有53页,每页约20行。

我基本上想从所有页面中获取所有行,即〜53 * 20个项目。

我的parse方法中有有效的代码,该代码分析单个页面,每个项目也深入一页,以获取有关该项目的更多信息:

  def parse(self, response):
    hxs = HtmlXPathSelector(response)

    restaurants = hxs.select('//*[@id="contenido-resbus"]/table/tr[position()>1]')

    for rest in restaurants:
      item = DegustaItem()
      item['name'] = rest.select('td[2]/a/b/text()').extract()[0]
      # some items don't have category associated with them
      try:
        item['category'] = rest.select('td[3]/a/text()').extract()[0]
      except:
        item['category'] = ''
      item['urbanization'] = rest.select('td[4]/a/text()').extract()[0]

      # get profile url
      rel_url = rest.select('td[2]/a/@href').extract()[0]
      # join with base url since profile url is relative
      base_url = get_base_url(response)
      follow = urljoin_rfc(base_url,rel_url)

      request = Request(follow, callback = parse_profile)
      request.meta['item'] = item
      return request


  def parse_profile(self, response):
    item = response.meta['item']
    # item['address'] = figure out xpath
    return item

问题是,如何爬行每个页面?

example.com/foo/bar/page_1.html
example.com/foo/bar/page_2.html
example.com/foo/bar/page_3.html
...
...
...
example.com/foo/bar/page_53.html

阅读 242

收藏
2021-01-20

共1个答案

小编典典

您有两种选择可以解决您的问题。一般的做法是使用yield来生成新请求return。这样,您可以从单个回调中发出多个新请求。在http://doc.scrapy.org/en/latest/topics/spiders.html#basespider-
example中查看第二个示例。

在您的情况下,可能有一个更简单的解决方案:只需从这样的模式中生成启动urs列表:

class MySpider(BaseSpider):
    start_urls = ['http://example.com/foo/bar/page_%s.html' % page for page in xrange(1,54)]
2021-01-20