我试图将1.父属性2.子属性和3.孙子文本放入数据框中。我能够将child属性和孙子文本打印在屏幕上,但是我无法让它们进入数据框。我从熊猫那里收到内存错误。
这是介绍内容
import requests from lxml import etree, objectify r = requests.get('https://api.stuff.us/place/getData? security_key=key&period=minutes&startTime=2013-05-01T00:00&endTime=2013-05-01T23:59&sort=channel') #edited for privacy root = etree.fromstring(r.text) xml_new = etree.tostring(root, pretty_print=True) print xml_new[300:900] #gives xml output to show structure <startTime>2013-05-01 00:00:00</startTime> <endTime>2013-05-01 23:59:00</endTime> <summaryPeriod>minutes</summaryPeriod> <data> <channel channel="97925" name="blah"> <Time Time="2013-05-01 00:00:00"> <value>258</value> </Time> <Time Time="2013-05-01 00:01:00"> <value>259</value> </Time> <Time Time="2013-05-01 00:02:00"> <value>258</value> </Time> <Time Time="2013-05-01 00:03:00"> <value>257</value> </Time>
这显示了我如何解析以获取child属性和孙子属性进行打印。
for df in root.xpath('//channel/Time'): ## Iterate over attributes of channel/Time for attrib in df.attrib: print '@' + attrib + '=' + df.attrib[attrib] ## value is a child of time, and iterate subfields = df.getchildren() for subfield in subfields: print 'subfield=' + subfield.text
它会按照要求提供很长的打印输出信息:
... @Time=2013-05-01 23:01:00 value=100 @Time=2013-05-01 23:02:00 value=101 @Time=2013-05-01 23:03:00 value=99 @Time=2013-05-01 23:04:00 value=101 ...
但是,当我尝试将其放入数据帧时,出现内存错误。我尝试了这两个方法,也只是尝试将child属性添加到数据框中。
data = [] for df in root.xpath('//channel/Time'): ## Iterate over attributes of channel/Time for attrib in df.attrib: el_data = {} el_data[attrib] = df.attrib[attrib] data.append(el_data) from pandas import * perf = DataFrame(data) perf --------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-6-08c8c74f7192> in <module>() 1 from pandas import * ----> 2 perf = DataFrame(data) 3 perf /Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy) 417 418 if isinstance(data[0], (list, tuple, collections.Mapping, Series)): --> 419 arrays, columns = _to_arrays(data, columns, dtype=dtype) 420 columns = _ensure_index(columns) 421 /Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype) 5457 return _list_of_dict_to_arrays(data, columns, 5458 coerce_float=coerce_float, -> 5459 dtype=dtype) 5460 elif isinstance(data[0], Series): 5461 return _list_of_series_to_arrays(data, columns, /Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site- packages/pandas/core/frame.pyc in _list_of_dict_to_arrays(data, columns, coerce_float, dtype) 5521 for d in data] 5522 -> 5523 content = list(lib.dicts_to_array(data, list(columns)).T) 5524 return _convert_object_array(content, columns, dtype=dtype, 5525 coerce_float=coerce_float) /Users/user/Library/Enthought/Canopy_32bit/User/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.dicts_to_array (pandas/lib.c:7657)() MemoryError:
我的xml文件中有12960个“值”的值。我假设这些内存错误告诉我有关文件中值的信息不符合预期的情况,但这与内存错误不符,并且我无法从其他有关内存错误的问题或熊猫文档。
尝试获取数据类型不会产生任何信息。也许没有类型?也许是因为它们是元素树中的元素。(我尝试打印.pyval,但它只告诉我没有属性。)el_data的类型为“ dict”
print(objectify.dump(root))[700:1000] #print a subset of types name = 'zone' Time = None [_Element] * Time = '2013-05-01 00:00:00' value = '258' [_Element] Time = None [_Element] * Time = '2013-05-01 00:01:00' value = '259' [_Element] type(el_data) dict
我基于《 Python进行数据分析》一书和在SO上找到的其他用于解析XML的示例构建了此代码。我还是python新手。
在Mac OS 10.7.5上运行Python 2.7.2
答案基于Jeff和JoeKington的帮助。在将数据推入数据框之前,需要将它们分别放入列表中。内存错误是由无法放入数据帧的多个“元素”引起的。取而代之的是,每个元素字典都需要放入一个可以放入数据帧的列表中。
这有效:
dTime=[] dvalue=[] for df in root.xpath('//channel/Time'): ## Iterate over attributes of channel for attrib in df.attrib: dTime.append(df.attrib[attrib]) ## value is a child of time, and iterate subfields = df.getchildren() for subfield in subfields: dvalue.append(subfield.text) pef=DataFrame({'Time':dTime,'values':dvalue}) pef <class 'pandas.core.frame.DataFrame'> Int64Index: 12960 entries, 0 to 12959 Data columns (total 2 columns): Time 12960 non-null values value 12960 non-null values dtypes: object(2) pef[:5] Time value 0 2013-05-01 00:00:00 258 1 2013-05-01 00:01:00 259 2 2013-05-01 00:02:00 258 3 2013-05-01 00:03:00 257 4 2013-05-01 00:04:00 257