import urllib2 website = "WEBSITE" openwebsite = urllib2.urlopen(website) html = getwebsite.read() print html
到现在为止还挺好。
但是我只希望纯文本HTML中的href链接。我怎么解决这个问题?
尝试使用Beautifulsoup:
from BeautifulSoup import BeautifulSoup import urllib2 import re html_page = urllib2.urlopen("http://www.yourwebsite.com") soup = BeautifulSoup(html_page) for link in soup.findAll('a'): print link.get('href')
如果您只想要以开头的链接http://,则应使用:
http://
soup.findAll('a', attrs={'href': re.compile("^http://")})
在带有BS4的Python 3中,它应该是:
from bs4 import BeautifulSoup import urllib.request html_page = urllib.request.urlopen("http://www.yourwebsite.com") soup = BeautifulSoup(html_page, "html.parser") for link in soup.findAll('a'): print(link.get('href'))