PHP-spider -


GPL
跨平台
PHP

软件简介

一个可扩展的PHP WEB 蜘蛛,示例代码:

use VDB\Spider\Spider;
use VDB\Spider\Discoverer\XPathExpressionDiscoverer;

$spider = new Spider('http://www.oschina.net');

特性:

  • supports two traversal algorithms: breadth-first and depth-first

  • supports depth limiting and queue size limiting

  • supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP

  • comes with a useful set of URI filters, such as Domain limiting

  • supports custom URI filters, both prefetch (URI) and postfetch (Resource content)

  • supports custom request handling logic

  • comes with a useful set of persistence handlers (memory, file. Redis soon to follow)

  • supports custom persistence handlers

  • collects statistics about the crawl for reporting

  • dispatches useful events, allowing developers to add even more custom behavior

  • supports a politeness policy

  • will soon come with many default discoverers: RSS, Atom, RDF, etc.

  • will soon support multiple queueing mechanisms (file, memcache, redis)

  • will eventually support distributed spidering with a central queue