simspider - 网络爬虫引擎
1.简介
simspider是一个轻巧的跨平台的网络爬虫引擎,它提供了一组C函数接口用于快速构建你自己的网络爬虫应用,同时也提供了一个可执行的爬虫程序用于演示函数接口如何使用。 simspider只依赖于第三方函数库libcurl。
simspider目前支持平台: UNIX/Linux WINDOWS
simspider函数接口非常容易使用,主流程如下: 创建爬虫引擎环境 设置爬虫引擎环境 从入口网址递归爬取所有网页 销毁爬虫引擎环境
有大量的可选选项用于定制你的爬虫引擎环境,包含但不限于下列: 设置请求队列空间大小 设置感兴趣的文件扩展名集合 是否允许文件扩展名为空 是否允许爬出当前网站 设置最大递归深度 设置HTTPS证书文件名 设置爬取间隔时间 设置爬取最大并发数量
simspider爬虫引擎实现了一个灵活的流程框架,提供了相当丰富的回调函数指针给予爬虫应用设计者想要在爬取的任何时间点加入自己自定义的处理逻辑,包含但不限于下列: 构建HTTP请求头时 构建HTTP请求体(往往是POST内容)时 获取到HTTP响应头时 获取到HTTP响应体(往往是HTML)时 (在以上4个回调函数中,爬虫应用设计者可以使用另外一批simspider函数接口得到上层网址、当前网址、响应码、递归深度、CURL对象以及HTTP缓冲区等信息) * 爬取完成后检阅完成队列
2.我的第一个爬虫程序
使用simspider爬虫引擎函数库实现一个爬虫应用相当容易,以下是一个简单示例: [code]
int main() { struct SimSpiderEnv *penv = NULL ; int nret = 0 ;
nret = InitSimSpiderEnv( & penv , NULL ) ; if( nret ) { printf( “InitSimSpiderEnv failed[%d]\n” , nret ); return 1; }
nret = SimSpiderGo( penv , “” , “http://localhost/" ) ; if( nret ) { printf( “SimSpiderGo failed[%d]\n” , nret ); return 1; }
CleanSimSpiderEnv( & penv );
return 0; } [/code]
…
6.自带爬虫运行演示
安装包中自带了一个爬虫src/simspider.c,运行如下:(家用台机PC中的虚拟机VMWARE的Red Hat Enterprise Linux Server release 5.4环境爬取外面WINDOWS XP Apache中的curl手册文档) [code] $ time ./simspider 192.168.6.79 5
[http://192.168.6.79/] [http://192.168.6.79/curl-config.html] [http://192.168.6.79/TheArtOfHttpScripting] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/index.html] [http://192.168.6.79/libcurl/libcurl.html] [http://192.168.6.79/libcurl/libcurl-easy.html] [http://192.168.6.79/libcurl/libcurl-multi.html] [http://192.168.6.79/libcurl/libcurl-share.html] [http://192.168.6.79/libcurl/libcurl-errors.html] [http://192.168.6.79/curl.html] [http://192.168.6.79/libcurl/curl_easy_cleanup.html] [http://192.168.6.79/libcurl/curl_easy_duphandle.html] [http://192.168.6.79/libcurl/curl_easy_escape.html] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [http://192.168.6.79/libcurl/curl_easy_init.html] [http://192.168.6.79/libcurl/curl_easy_pause.html] [http://192.168.6.79/libcurl/curl_easy_perform.html] [http://192.168.6.79/libcurl/curl_easy_recv.html] [http://192.168.6.79/libcurl/curl_easy_reset.html] [http://192.168.6.79/libcurl/curl_easy_strerror.html] [http://192.168.6.79/libcurl/curl_easy_unescape.html] [http://192.168.6.79/libcurl/curl_escape.html] [http://192.168.6.79/libcurl/curl_formadd.html] [http://192.168.6.79/libcurl/curl_formfree.html] [http://192.168.6.79/libcurl/curl_formget.html] [http://192.168.6.79/libcurl/curl_free.html] [http://192.168.6.79/libcurl/curl_getenv.html] [http://192.168.6.79/libcurl/curl_easy_send.html] [http://192.168.6.79/libcurl/curl_global_cleanup.html] [http://192.168.6.79/libcurl/curl_global_init.html] [http://192.168.6.79/libcurl/curl_global_init_mem.html] [http://192.168.6.79/libcurl/curl_mprintf.html] [http://192.168.6.79/libcurl/curl_multi_add_handle.html] [http://192.168.6.79/libcurl/curl_multi_assign.html] [http://192.168.6.79/libcurl/curl_multi_cleanup.html] [http://192.168.6.79/libcurl/curl_multi_fdset.html] [http://192.168.6.79/libcurl/curl_getdate.html] [http://192.168.6.79/libcurl/curl_multi_info_read.html] [http://192.168.6.79/libcurl/curl_multi_init.html] [http://192.168.6.79/libcurl/curl_multi_perform.html] [http://192.168.6.79/libcurl/curl_multi_remove_handle.html] [http://192.168.6.79/libcurl/curl_multi_setopt.html] [http://192.168.6.79/libcurl/curl_multi_socket.html] [http://192.168.6.79/libcurl/curl_multi_socket_action.html] [http://192.168.6.79/libcurl/curl_multi_strerror.html] [http://192.168.6.79/libcurl/curl_multi_timeout.html] [http://192.168.6.79/libcurl/curl_share_cleanup.html] [http://192.168.6.79/libcurl/curl_share_init.html] [http://192.168.6.79/libcurl/curl_share_setopt.html] [http://192.168.6.79/libcurl/curl_share_strerror.html] [http://192.168.6.79/libcurl/curl_slist_append.html] [http://192.168.6.79/libcurl/curl_slist_free_all.html] [http://192.168.6.79/libcurl/curl_strequal.html] [http://192.168.6.79/libcurl/curl_unescape.html] [http://192.168.6.79/libcurl/curl_version.html] [http://192.168.6.79/libcurl/curl_version_info.html] [http://192.168.6.79/libcurl/] [http://192.168.6.79/libcurl/libcurl-tutorial.html] [http://192.168.6.79/libcurl/curl_easy_setopt.html] [ 200] [ 1] [] [http://192.168.6.79/] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/TheArtOfHttpScripting] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl-config.html] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/curl.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/index.html] [ 200] [ 4] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [http://192.168.6.79/libcurl/] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_duphandle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_escape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_getinfo.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_pause.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_perform.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_recv.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_reset.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_send.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_easy_unescape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_escape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formadd.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formfree.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_formget.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_free.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getdate.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_getenv.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_global_init_mem.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_mprintf.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_add_handle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_assign.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_fdset.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_info_read.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_perform.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_remove_handle.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_socket_action.html] [ 404] [ 4] [http://192.168.6.79/libcurl/curl_multi_socket.html] [http://192.168.6.79/libcurl/curl_multi_socket_all.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_multi_timeout.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_cleanup.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_init.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_setopt.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_share_strerror.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_append.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_slist_free_all.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_strequal.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_unescape.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/curl_version_info.html] [ 200] [ 2] [http://192.168.6.79/] [http://192.168.6.79/libcurl/index.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-easy.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-errors.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-multi.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-share.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl-tutorial.html] [ 200] [ 3] [http://192.168.6.79/libcurl/index.html] [http://192.168.6.79/libcurl/libcurl.html] real 0m0.452s user 0m0.062s sys 0m0.360s [/code]
7.最后
是不是越看越心动了?那就赶紧下载来玩玩吧
如有问题或建议欢迎联系我 ^_^ 开源项目首页 : http://git.oschina.net/calvinwilliams/simspider 作者邮箱 : calvinwilliams.c@gmail.com