我正在尝试抓取一个网站,但这给我一个错误。
我正在使用以下代码:
import urllib.request from bs4 import BeautifulSoup get = urllib.request.urlopen("https://www.website.com/") html = get.read() soup = BeautifulSoup(html) print(soup)
我收到以下错误:
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 70924-70950: character maps to <undefined>
我该怎么做才能解决此问题?
我通过添加将.encode("utf-8")其修复soup。
.encode("utf-8")
soup
那意味着print(soup)变成print(soup.encode("utf-8"))。
print(soup)
print(soup.encode("utf-8"))
UnicodeEncodeError将抓取的网页内容保存到文件中时,我得到的是相同的。为了解决这个问题,我替换了以下代码:
UnicodeEncodeError
with open(fname, "w") as f: f.write(html)
有了这个:
import io with io.open(fname, "w", encoding="utf-8") as f: f.write(html)
使用io可以向后兼容Python 2。
io
如果只需要支持Python 3,则可以改用内置open函数:
open
with open(fname, "w", encoding="utf-8") as f: f.write(html)