此处的答案(原始响应的大小(以字节为单位))表示:
只需考虑len()响应的内容: >>> response = requests.get('https://github.com/') >>> len(response.content) 51671
只需考虑len()响应的内容:
len()
>>> response = requests.get('https://github.com/') >>> len(response.content) 51671
但是,这样做不能获得准确的内容长度。例如,查看以下python代码:
import sys import requests def proccessUrl(url): try: r = requests.get(url) print("Correct Content Length: "+r.headers['Content-Length']) print("bytes of r.text : "+str(sys.getsizeof(r.text))) print("bytes of r.content : "+str(sys.getsizeof(r.content))) print("len r.text : "+str(len(r.text))) print("len r.content : "+str(len(r.content))) except Exception as e: print(str(e)) #this url contains a content-length header, we will use that to see if the content length we calculate is the same. proccessUrl("https://stackoverflow.com")
如果我们尝试手动计算内容长度并将其与标头中的内容进行比较,则会得到更大的答案?
Correct Content Length: 51504 bytes of r.text : 515142 bytes of r.content : 257623 len r.text : 257552 len r.content : 257606
为什么不len(r.content)返回正确的内容长度?如果缺少标题,我们如何准确地手动计算呢?
len(r.content)
的Content-Length报头反映了响应的主体中。这与textorcontent属性的长度不同,因为响应可以被 压缩 。requests为您解压缩响应。
Content-Length
text
content
requests
您必须绕过许多内部管道来获取原始的,压缩的原始内容,然后,如果您希望response对象仍然能够正常工作,则必须访问更多内部组件。“最简单”的方法是启用流传输,然后从原始套接字读取:
response
from io import BytesIO r = requests.get(url, stream=True) # read directly from the raw urllib3 connection raw_content = r.raw.read() content_length = len(raw_content) # replace the internal file-object to serve the data again r.raw._fp = BytesIO(raw_content)
演示:
>>> import requests >>> from io import BytesIO >>> url = "https://stackoverflow.com" >>> r = requests.get(url, stream=True) >>> r.headers['Content-Encoding'] # a compressed response 'gzip' >>> r.headers['Content-Length'] # the raw response contains 52055 bytes of compressed data '52055' >>> r.headers['Content-Type'] # we are served UTF-8 HTML data 'text/html; charset=utf-8' >>> raw_content = r.raw.read() >>> len(raw_content) # the raw content body length 52055 >>> r.raw._fp = BytesIO(raw_content) >>> len(r.content) # the decompressed binary content, byte count 258719 >>> len(r.text) # the Unicode content decoded from UTF-8, character count 258658
这会将完整的响应读入内存,因此,如果您期望较大的响应,请不要使用它!在这种情况下,您可以改为shutil.copyfileobj()将数据从r.raw文件复制到假脱机临时文件(一旦达到特定大小,它将切换到磁盘上的文件),获取该文件的文件大小,然后填充该文件上r.raw._fp。
shutil.copyfileobj()
r.raw
r.raw._fp
将Content-Type标头添加到缺少该标头的任何请求的函数应如下所示:
Content-Type
import requests import shutil import tempfile def ensure_content_length( url, *args, method='GET', session=None, max_size=2**20, # 1Mb **kwargs ): kwargs['stream'] = True session = session or requests.Session() r = session.request(method, url, *args, **kwargs) if 'Content-Length' not in r.headers: # stream content into a temporary file so we can get the real size spool = tempfile.SpooledTemporaryFile(max_size) shutil.copyfileobj(r.raw, spool) r.headers['Content-Length'] = str(spool.tell()) spool.seek(0) # replace the original socket with our temporary file r.raw._fp.close() r.raw._fp = spool return r
这接受现有的会话,并允许您也指定请求方法。max_size根据需要调整内存限制。上的演示https://github.com,缺少Content- Length标题:
max_size
https://github.com
Content- Length
>>> r = ensure_content_length('https://github.com/') >>> r <Response [200]> >>> r.headers['Content-Length'] '14490' >>> len(r.content) 54814
请注意,如果不存在Content-Encoding标题,或者该标题的值设置为identity,并且Content- Length可用,那么您可以依靠Content-Length响应的完整大小。那是因为那时显然没有压缩。
Content-Encoding
identity
附带说明:sys.getsizeof()如果您所追求的是abytes或str对象的长度(该对象中的字节或字符数),则不应使用。sys.getsizeof()为您提供了Python对象的内部内存占用空间,该内存占用空间不只是该对象中的字节数或字符数。
sys.getsizeof()
bytes
str