使用python和BeautifulSoup从网页检索链接


使用python和BeautifulSoup从网页检索链接


方法1

这是使用BeautifulSoup中的SoupStrainer类的简短片段:

import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer

http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')

for link in BeautifulSoup(response, parse_only=SoupStrainer('a')):
    if link.has_attr('href'):
        print link['href']

方法2

为了完整起见,BeautifulSoup 4版本也使用了服务器提供的编码:

from bs4 import BeautifulSoup
import urllib2

resp = urllib2.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().getparam('charset'))

for link in soup.find_all('a', href=True):
    print link['href']

或Python 3版本:

from bs4 import BeautifulSoup
import urllib.request

resp = urllib.request.urlopen("http://www.gpsbasecamp.com/national-parks")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

以及使用该requests库的版本,其编写将适用于Python 2和3:

from bs4 import BeautifulSoup
from bs4.dammit import EncodingDetector
import requests

resp = requests.get("http://www.gpsbasecamp.com/national-parks")
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
soup = BeautifulSoup(resp.content, from_encoding=encoding)

for link in soup.find_all('a', href=True):
    print(link['href'])