我收到以下错误消息:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)
不确定是什么u'\ufeff',在我进行网页抓取时会显示。我该如何纠正这种情况?该.replace()字符串的方法不能进行这项工作。
u'\ufeff'
.replace()
Unicode字符U+FEFF是字节顺序标记或BOM,用于区分大尾数UTF-16编码之间的区别。如果您使用正确的编解码器解码网页,Python会为您删除它。例子:
U+FEFF
#!python2 #coding: utf8 u = u'ABC' e8 = u.encode('utf-8') # encode without BOM e8s = u.encode('utf-8-sig') # encode with BOM e16 = u.encode('utf-16') # encode with BOM e16le = u.encode('utf-16le') # encode without BOM e16be = u.encode('utf-16be') # encode without BOM print 'utf-8 %r' % e8 print 'utf-8-sig %r' % e8s print 'utf-16 %r' % e16 print 'utf-16le %r' % e16le print 'utf-16be %r' % e16be print print 'utf-8 w/ BOM decoded with utf-8 %r' % e8s.decode('utf-8') print 'utf-8 w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig') print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16') print 'utf-16 w/ BOM decoded with utf-16le %r' % e16.decode('utf-16le')
请注意,这EF BB BF是UTF-8编码的BOM。对于UTF-8,它不是必需的,而仅用作签名(通常在Windows上)。
EF BB BF
输出:
utf-8 'ABC' utf-8-sig '\xef\xbb\xbfABC' utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness. utf-16le 'A\x00B\x00C\x00' utf-16be '\x00A\x00B\x00C' utf-8 w/ BOM decoded with utf-8 u'\ufeffABC' # doesn't remove BOM if present. utf-8 w/ BOM decoded with utf-8-sig u'ABC' # removes BOM if present. utf-16 w/ BOM decoded with utf-16 u'ABC' # *requires* BOM to be present. utf-16 w/ BOM decoded with utf-16le u'\ufeffABC' # doesn't remove BOM if present.
请注意,utf-16编解码器 要求 存在BOM表,否则Python将不知道数据是大端还是小端。
utf-16