Scrapy 是一套基于基于Twisted的异步处理框架,纯python实现的爬虫框架,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便~
示例代码:
$pip install scrapy $cat > myspider.py <<EOF import scrapy class BlogSpider(scrapy.Spider): name = 'blogspider' start_urls = ['https://blog.scrapinghub.com'] def parse(self, response): for title in response.css('h2.entry-title'): yield {'title': title.css('a ::text').extract_first()} next_page = response.css('div.prev-post > a ::attr(href)').extract_first() if next_page: yield scrapy.Request(response.urljoin(next_page), callback=self.parse) EOF $scrapy runspider myspider.py