在Python中如何解析html


在Python中如何解析html


使用html.unescape()

import html
print(html.unescape('£682m'))

判断是Python2还是Python3

>>> try:
...     # Python 2.6-2.7
...     from HTMLParser import HTMLParser
... except ImportError:
...     # Python 3
...     from html.parser import HTMLParser
...
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

您还可以使用six兼容性库来简化导入:

>>> from six.moves.html_parser import HTMLParser
>>> h = HTMLParser()
>>> print(h.unescape('£682m'))
£682m

使用BeautifulSoup

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>&pound;682m</p>")
<html><body><p>£682m</p></body></html>

w3lib.html库

In [202]: from w3lib.html import replace_entities

In [203]: replace_entities("&pound;682m")
Out[203]: u'\xa3682m'

In [204]: print replace_entities("&pound;682m")
£682m

格式化

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
print(link_soup.a.encode(formatter=None))
# <a href="http://example.com/?foo=val1&bar=val2">A link</a>