一个可扩展的PHP WEB 蜘蛛,示例代码:
use VDB\Spider\Spider; use VDB\Spider\Discoverer\XPathExpressionDiscoverer; $spider = new Spider('http://www.oschina.net');
特性:
supports two traversal algorithms: breadth-first and depth-first
supports depth limiting and queue size limiting
supports adding custom URI discovery logic, based on XPath, CSS selectors, or plain old PHP
comes with a useful set of URI filters, such as Domain limiting
supports custom URI filters, both prefetch (URI) and postfetch (Resource content)
supports custom request handling logic
comes with a useful set of persistence handlers (memory, file. Redis soon to follow)
supports custom persistence handlers
collects statistics about the crawl for reporting
dispatches useful events, allowing developers to add even more custom behavior
supports a politeness policy
will soon come with many default discoverers: RSS, Atom, RDF, etc.
will soon support multiple queueing mechanisms (file, memcache, redis)
will eventually support distributed spidering with a central queue