u'\ ufeff'在Python字符串中

小编典典

u'\ ufeff'在Python字符串中

python

我收到以下错误消息：

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

不确定是什么u'\ufeff'，在我进行网页抓取时会显示。我该如何纠正这种情况？该.replace()字符串的方法不能进行这项工作。

阅读 215

2020-12-20

共1个答案

小编典典

Unicode字符U+FEFF是字节顺序标记或BOM，用于区分大尾数UTF-16编码之间的区别。如果您使用正确的编解码器解码网页，Python会为您删除它。例子：

#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')        # encode without BOM
e8s = u.encode('utf-8-sig')   # encode with BOM
e16 = u.encode('utf-16')      # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8     %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16    %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8     %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16    %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')

请注意，这EF BB BF是UTF-8编码的BOM。对于UTF-8，它不是必需的，而仅用作签名（通常在Windows上）。

输出：

utf-8     'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16    '\xff\xfeA\x00B\x00C\x00'    # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8     u'\ufeffABC'    # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'          # removes BOM if present.
utf-16 w/ BOM decoded with utf-16    u'ABC'          # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC'    # doesn't remove BOM if present.

请注意，utf-16编解码器要求存在BOM表，否则Python将不知道数据是大端还是小端。

2020-12-20