我有一个原始HTTP字符串,我想代表一个对象中的字段。有什么方法可以解析HTTP字符串中的各个标头?
'GET /search?sourceid=chrome&ie=UTF-8&q=ergterst HTTP/1.1\r\nHost: www.google.com\r\nConnection: keep-alive\r\nAccept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5\r\nUser-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US) AppleWebKit/534.13 (KHTML, like Gecko) Chrome/9.0.597.45 Safari/534.13\r\nAccept-Encoding: gzip,deflate,sdch\r\nAvail-Dictionary: GeNLY2f-\r\nAccept-Language: en-US,en;q=0.8\r\n [...]'
更新: 现在是2019年,所以在程序员尝试使用该代码的混乱评论之后,我为Python 3重写了此答案。原始的Python 2代码现在位于答案的底部。
标准库中有出色的工具,可以解析RFC 821标头,也可以解析整个HTTP请求。这是一个示例请求字符串(请注意,即使为了方便阅读,我们将其分成几行,Python仍将其视为一个大字符串),可以将其提供给示例:
request_text = ( b'GET /who/ken/trust.html HTTP/1.1\r\n' b'Host: cm.bell-labs.com\r\n' b'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n' b'Accept: text/html;q=0.9,text/plain\r\n' b'\r\n' )
正如@TryPyPy指出的那样,您可以使用Python的电子邮件库来解析标头- 尽管我们应该添加一个结果Message,一旦完成创建后,结果对象就像标头的字典:
Message
from email.parser import BytesParser request_line, headers_alone = request_text.split(b'\r\n', 1) headers = BytesParser().parsebytes(headers_alone) print(len(headers)) # -> "3" print(headers.keys()) # -> ['Host', 'Accept-Charset', 'Accept'] print(headers['Host']) # -> "cm.bell-labs.com"
但这当然会忽略请求行,或者让您自己解析它。事实证明,有一个更好的解决方案。
如果使用标准库,标准库将为您解析HTTP BaseHTTPRequestHandler。尽管其文档有点晦涩(标准库中整个HTTP和URL工具套件都存在问题),但您要做的只是解析(a)将字符串包装在BytesIO()(b)中,raw_requestline因此它随时可以解析,并且(c)捕获解析期间发生的任何错误代码,而不是让它尝试将其写回客户端(因为我们没有密码!)。
BaseHTTPRequestHandler
BytesIO()
raw_requestline
因此,这是我们对标准库类的专门化:
from http.server import BaseHTTPRequestHandler from io import BytesIO class HTTPRequest(BaseHTTPRequestHandler): def __init__(self, request_text): self.rfile = BytesIO(request_text) self.raw_requestline = self.rfile.readline() self.error_code = self.error_message = None self.parse_request() def send_error(self, code, message): self.error_code = code self.error_message = message
再一次,我希望标准库的人们意识到HTTP解析应该以一种不需要我们编写9行代码来正确调用的方式进行,但是您能做什么?这是您将如何使用此简单类的方法:
# Using this new class is really easy! request = HTTPRequest(request_text) print(request.error_code) # None (check this first) print(request.command) # "GET" print(request.path) # "/who/ken/trust.html" print(request.request_version) # "HTTP/1.1" print(len(request.headers)) # 3 print(request.headers.keys()) # ['Host', 'Accept-Charset', 'Accept'] print(request.headers['host']) # "cm.bell-labs.com"
如果解析期间发生错误,error_code则不会是None:
error_code
None
# Parsing can result in an error code and message request = HTTPRequest(b'GET\r\nHeader: Value\r\n\r\n') print(request.error_code) # 400 print(request.error_message) # "Bad request syntax ('GET')"
我更喜欢这样使用标准库,因为如果我尝试使用正则表达式自己重新实现Internet规范,我怀疑它们已经遇到并解决了可能会困扰我的所有边缘情况。
这是此答案的原始代码,可追溯到我第一次编写它时:
request_text = ( 'GET /who/ken/trust.html HTTP/1.1\r\n' 'Host: cm.bell-labs.com\r\n' 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3\r\n' 'Accept: text/html;q=0.9,text/plain\r\n' '\r\n' )
和:
# Ignore the request line and parse only the headers from mimetools import Message from StringIO import StringIO request_line, headers_alone = request_text.split('\r\n', 1) headers = Message(StringIO(headers_alone)) print len(headers) # -> "3" print headers.keys() # -> ['accept-charset', 'host', 'accept'] print headers['Host'] # -> "cm.bell-labs.com"
from BaseHTTPServer import BaseHTTPRequestHandler from StringIO import StringIO class HTTPRequest(BaseHTTPRequestHandler): def __init__(self, request_text): self.rfile = StringIO(request_text) self.raw_requestline = self.rfile.readline() self.error_code = self.error_message = None self.parse_request() def send_error(self, code, message): self.error_code = code self.error_message = message
# Using this new class is really easy! request = HTTPRequest(request_text) print request.error_code # None (check this first) print request.command # "GET" print request.path # "/who/ken/trust.html" print request.request_version # "HTTP/1.1" print len(request.headers) # 3 print request.headers.keys() # ['accept-charset', 'host', 'accept'] print request.headers['host'] # "cm.bell-labs.com"
# Parsing can result in an error code and message request = HTTPRequest('GET\r\nHeader: Value\r\n\r\n') print request.error_code # 400 print request.error_message # "Bad request syntax ('GET')"