我需要从大约6000万行的数据集中创建一个2000列,大约30-50百万行的数据透视表。我曾尝试过旋转100,000行的数据块,但这种方法行得通,但是当我尝试通过先执行.append()然后再执行.groupby(’someKey’)。sum()来重组DataFrame时,我的所有内存都被占用了和python最终崩溃。
如何在有限的RAM量下处理如此大的数据?
编辑:添加示例代码
下面的代码在此过程中包括各种测试输出,但是最后打印的是我们真正感兴趣的内容。请注意,如果将segMax更改为3(而不是4),则该代码将为正确的输出产生误报。主要的问题是,如果sum(wawa)所查看的每个块中都不存在shipmentid条目,则该条目不会显示在输出中。
import pandas as pd import numpy as np import random from pandas.io.pytables import * import os pd.set_option('io.hdf.default_format','table') # create a small dataframe to simulate the real data. def loadFrame(): frame = pd.DataFrame() frame['shipmentid']=[1,2,3,1,2,3,1,2,3] #evenly distributing shipmentid values for testing purposes frame['qty']= np.random.randint(1,5,9) #random quantity is ok for this test frame['catid'] = np.random.randint(1,5,9) #random category is ok for this test return frame def pivotSegment(segmentNumber,passedFrame): segmentSize = 3 #take 3 rows at a time frame = passedFrame[(segmentNumber*segmentSize):(segmentNumber*segmentSize + segmentSize)] #slice the input DF # ensure that all chunks are identically formatted after the pivot by appending a dummy DF with all possible category values span = pd.DataFrame() span['catid'] = range(1,5+1) span['shipmentid']=1 span['qty']=0 frame = frame.append(span) return frame.pivot_table(['qty'],index=['shipmentid'],columns='catid', \ aggfunc='sum',fill_value=0).reset_index() def createStore(): store = pd.HDFStore('testdata.h5') return store segMin = 0 segMax = 4 store = createStore() frame = loadFrame() print('Printing Frame') print(frame) print(frame.info()) for i in range(segMin,segMax): segment = pivotSegment(i,frame) store.append('data',frame[(i*3):(i*3 + 3)]) store.append('pivotedData',segment) print('\nPrinting Store') print(store) print('\nPrinting Store: data') print(store['data']) print('\nPrinting Store: pivotedData') print(store['pivotedData']) print('**************') print(store['pivotedData'].set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('**************') print('$$$') for df in store.select('pivotedData',chunksize=3): print(df.set_index('shipmentid').groupby('shipmentid',level=0).sum()) print('$$$') store['pivotedAndSummed'] = sum((df.set_index('shipmentid').groupby('shipmentid',level=0).sum() for df in store.select('pivotedData',chunksize=3))) print('\nPrinting Store: pivotedAndSummed') print(store['pivotedAndSummed']) store.close() os.remove('testdata.h5') print('closed')
您可以使用HDF5 / pytables进行附加。这样可以使其脱离RAM。
使用表格格式:
store = pd.HDFStore('store.h5') for ...: ... chunk # the chunk of the DataFrame (which you want to append) store.append('df', chunk)
现在,您可以一次性将其作为DataFrame读入(假设此DataFrame可以容纳在内存中!):
df = store['df']
您也可以查询以仅获取DataFrame的子部分。
撇开:您还应该购买更多的RAM,这很便宜。
编辑:您可以从存储中迭代分组/求和,因为此“映射减少”了块:
# note: this doesn't work, see below sum(df.groupby().sum() for df in store.select('df', chunksize=50000)) # equivalent to (but doesn't read in the entire frame) store['df'].groupby().sum()
Edit2:如上所述使用sum并不能在熊猫0.16中正常工作(我认为它在0.15.2中是有效的),而是可以reduce与一起使用add:
reduce
add
reduce(lambda x, y: x.add(y, fill_value=0), (df.groupby().sum() for df in store.select('df', chunksize=50000)))
在python 3中,您必须 从functools导入reduce。
也许将其编写为:
chunks = (df.groupby().sum() for df in store.select('df', chunksize=50000)) res = next(chunks) # will raise if there are no chunks! for c in chunks: res = res.add(c, fill_value=0)
如果性能不佳/如果有大量新组,则最好将res设为正确大小的零(通过获取唯一的组密钥,例如通过遍历块),然后添加到位。