我有一个千兆字节的JSON文件。该文件由每个不超过几千个字符的JSON对象组成,但是记录之间没有换行符。
使用Python 3和json模块,如何一次将一个JSON对象从文件读入内存?
json
数据在纯文本文件中。这是类似记录的示例。实际记录包含许多嵌套的字典和列表。
以可读格式记录:
{ "results": { "__metadata": { "type": "DataServiceProviderDemo.Address" }, "Street": "NE 228th", "City": "Sammamish", "State": "WA", "ZipCode": "98074", "Country": "USA" } } }
实际格式。新记录一个接一个地开始,没有任何中断。
{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }{"results": { "__metadata": {"type": "DataServiceProviderDemo.Address"},"Street": "NE 228th","City": "Sammamish","State": "WA","ZipCode": "98074","Country": "USA" } } }
一般而言,将多个JSON对象放入一个文件会使该文件 无效,JSON损坏 。也就是说,您仍然可以使用JSONDecoder.raw_decode()method来解析数据。
JSONDecoder.raw_decode()
当解析器找到它们时,以下将产生完整的对象:
from json import JSONDecoder from functools import partial def json_parse(fileobj, decoder=JSONDecoder(), buffersize=2048): buffer = '' for chunk in iter(partial(fileobj.read, buffersize), ''): buffer += chunk while buffer: try: result, index = decoder.raw_decode(buffer) yield result buffer = buffer[index:].lstrip() except ValueError: # Not enough data to decode, read more break
该函数将从块中的给定文件对象中读取块buffersize,并使decoder对象从缓冲区解析整个JSON对象。每个解析的对象都交给调用者。
buffersize
decoder
像这样使用它:
with open('yourfilename', 'r') as infh: for data in json_parse(infh): # process object
仅当您的JSON对象背对背写入文件且中间没有换行符时,才使用此选项。如果您 确实 有换行符,并且每个JSON对象都限于一行,那么您将拥有一个JSON Lines文档