我的代码在这里:
# coding:utf-8 if __name__ == '__main__': from urllib2 import urlopen url = 'http://iccna.blog.sohu.com/164572951.html' data = urlopen(url).read() soup = BeautifulSoup(data,fromEncoding='gb18030') print WebExtractor(soup)
但是在调试时,数据如下:
��5h�,��4�H�5��VM��\
我应该怎么做才能获得BeautifulSoup的正确数据?谢谢!
问题是服务器返回由Gzip压缩的数据。尝试这个:
#-*- coding: utf-8 -*- from __future__ import print_function import gzip import StringIO import urllib2 from BeautifulSoup import BeautifulSoup url = 'http://iccna.blog.sohu.com/164572951.html' response = urllib2.urlopen(url) data = response.read() data = StringIO.StringIO(data) gzipper = gzip.GzipFile(fileobj=data) html = gzipper.read() soup = BeautifulSoup(html, fromEncoding='gbk') print(soup)
在我的系统上,汉字看起来仍然不对,但这可能会为您提供正确的方向。