这是我的代码:
#!C:/Python27/python # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import urllib2 import sys import urlparse import io url = "http://www.dlib.org/dlib/november14/beel/11beel.html" #url = "http://eqa.unibo.it/article/view/4554" #r = requests.get(url) html = urllib2.urlopen(url) soup = BeautifulSoup(html, "html.parser") #soup = BeautifulSoup(r.text,'lxml') if url.find("http://www.dlib.org") != -1: div = soup.find('td', valign='top') else: div = soup.find('div',id='content') f = open('path/file_name.html', 'w') f.write(str(div)) f.close()
在搜寻这些网页时,我发现一些非AScii字符进入了从该脚本编写的html文件中,我需要删除这些字符或将其解析为可读字符。有什么建议吗?谢谢
字符为8字节(0-255),ascii字符为7字节(0-127),因此您只需删除ord值低于128的所有字符
chr将整数转换为字符,ord将字符转换为整数。
text = ''.join((c for c in str(div) if ord(c) < 128)
这应该是您的最终代码
#!C:/Python27/python # -*- coding: utf-8 -*- import requests from bs4 import BeautifulSoup import urllib2 import sys import urlparse import io url = "http://www.dlib.org/dlib/november14/beel/11beel.html" #url = "http://eqa.unibo.it/article/view/4554" #r = requests.get(url) html = urllib2.urlopen(url) soup = BeautifulSoup(html, "html.parser") #soup = BeautifulSoup(r.text,'lxml') if url.find("http://www.dlib.org") != -1: div = soup.find('td', valign='top') else: div = soup.find('div',id='content') f = open('path/file_name.html', 'w') text = ''.join((c for c in str(div) if ord(c) < 128) f.write(text) f.close()