我有一个很大的JSON(11 gb)文件,太大了,无法读入我的内存。我想将其分解为较小的文件以分析数据。我目前正在使用Python和Pandas进行分析,我想知道是否有某种方法可以访问文件块,以便可以将其读入内存而不会导致程序崩溃。理想情况下,我想将数年的数据分成较小的可管理文件,这些文件跨度大约一周,但是数据大小不是恒定的,尽管它们是否有固定的间隔并不重要。
这是数据格式
{ "actor" : { "classification" : [ "suggested" ], "displayName" : "myself", "followersCount" : 0, "followingCount" : 0, "followingStocksCount" : 0, "id" : "person:stocktwits:183087", "image" : "http://avatars.stocktwits.com/production/183087/thumb-1350332393.png", "link" : "http://stocktwits.com/myselfbtc", "links" : [ { "href" : null, "rel" : "me" } ], "objectType" : "person", "preferredUsername" : "myselfbtc", "statusesCount" : 2, "summary" : null, "tradingStrategy" : { "approach" : "Technical", "assetsFrequentlyTraded" : [ "Forex" ], "experience" : "Novice", "holdingPeriod" : "Day Trader" } }, "body" : "$BCOIN and macd is going down ..... http://stks.co/iDEB", "entities" : { "chart" : { "fullImage" : { "link" : "http://charts.stocktwits.com/production/original_10047145.png" }, "image" : { "link" : "http://charts.stocktwits.com/production/small_10047145.png" }, "link" : "http://stks.co/iDEB", "objectType" : "image" }, "sentiment" : { "basic" : "Bearish" }, "stocks" : [ { "displayName" : "Bitcoin", "exchange" : "PRIVATE", "industry" : null, "sector" : null, "stocktwits_id" : 9659, "symbol" : "BCOIN" } ], "video" : null }, "gnip" : { "language" : { "value" : "en" } }, "id" : "tag:gnip.stocktwits.com:2012:note/10047145", "inReplyTo" : { "id" : "tag:gnip.stocktwits.com:2012:note/10046953", "objectType" : "comment" }, "link" : "http://stocktwits.com/myselfbtc/message/10047145", "object" : { "id" : "note:stocktwits:10047145", "link" : "http://stocktwits.com/myselfbtc/message/10047145", "objectType" : "note", "postedTime" : "2012-10-17T19:13:50Z", "summary" : "$BCOIN and macd is going down ..... http://stks.co/iDEB", "updatedTime" : "2012-10-17T19:13:50Z" }, "provider" : { "displayName" : "StockTwits", "link" : "http://stocktwits.com" }, "verb" : "post" }
jq 1.5具有流解析器(在http://stedolan.github.io/jq/manual/#Streaming中记录)。从某种意义上说,它很容易使用,例如,如果您的1G文件名为1G.json,则以下命令将生成一行行,其中每个“叶子”值包含一行:
jq -c --stream . 1G.json
(输出如下所示。请注意,每一行本身都是有效的JSON。)
但是,使用流输出可能并不那么容易,但这取决于您要执行的操作:-)
理解流输出的关键是大多数行具有以下形式:
[ PATH, VALUE ]
其中“ PATH”是路径的数组表示形式。(当使用jq时,该数组实际上可以用作路径。)
[["actor","classification",0],"suggested"] [["actor","classification",0]] [["actor","displayName"],"myself"] [["actor","followersCount"],0] [["actor","followingCount"],0] [["actor","followingStocksCount"],0] [["actor","id"],"person:stocktwits:183087"] [["actor","image"],"http://avatars.stocktwits.com/production/183087/thumb-1350332393.png"] [["actor","link"],"http://stocktwits.com/myselfbtc"] [["actor","links",0,"href"],null] [["actor","links",0,"rel"],"me"] [["actor","links",0,"rel"]] [["actor","links",0]] [["actor","objectType"],"person"] [["actor","preferredUsername"],"myselfbtc"] [["actor","statusesCount"],2] [["actor","summary"],null] [["actor","tradingStrategy","approach"],"Technical"] [["actor","tradingStrategy","assetsFrequentlyTraded",0],"Forex"] [["actor","tradingStrategy","assetsFrequentlyTraded",0]] [["actor","tradingStrategy","experience"],"Novice"] [["actor","tradingStrategy","holdingPeriod"],"Day Trader"] [["actor","tradingStrategy","holdingPeriod"]] [["actor","tradingStrategy"]] [["body"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"] [["entities","chart","fullImage","link"],"http://charts.stocktwits.com/production/original_10047145.png"] [["entities","chart","fullImage","link"]] [["entities","chart","image","link"],"http://charts.stocktwits.com/production/small_10047145.png"] [["entities","chart","image","link"]] [["entities","chart","link"],"http://stks.co/iDEB"] [["entities","chart","objectType"],"image"] [["entities","chart","objectType"]] [["entities","sentiment","basic"],"Bearish"] [["entities","sentiment","basic"]] [["entities","stocks",0,"displayName"],"Bitcoin"] [["entities","stocks",0,"exchange"],"PRIVATE"] [["entities","stocks",0,"industry"],null] [["entities","stocks",0,"sector"],null] [["entities","stocks",0,"stocktwits_id"],9659] [["entities","stocks",0,"symbol"],"BCOIN"] [["entities","stocks",0,"symbol"]] [["entities","stocks",0]] [["entities","video"],null] [["entities","video"]] [["gnip","language","value"],"en"] [["gnip","language","value"]] [["gnip","language"]] [["id"],"tag:gnip.stocktwits.com:2012:note/10047145"] [["inReplyTo","id"],"tag:gnip.stocktwits.com:2012:note/10046953"] [["inReplyTo","objectType"],"comment"] [["inReplyTo","objectType"]] [["link"],"http://stocktwits.com/myselfbtc/message/10047145"] [["object","id"],"note:stocktwits:10047145"] [["object","link"],"http://stocktwits.com/myselfbtc/message/10047145"] [["object","objectType"],"note"] [["object","postedTime"],"2012-10-17T19:13:50Z"] [["object","summary"],"$BCOIN and macd is going down ..... http://stks.co/iDEB"] [["object","updatedTime"],"2012-10-17T19:13:50Z"] [["object","updatedTime"]] [["provider","displayName"],"StockTwits"] [["provider","link"],"http://stocktwits.com"] [["provider","link"]] [["verb"],"post"] [["verb"]]