xml - Python - Generator function resets between calls? -
i'm parsing language dictionary, represented in xml file, elementtree's iterparse function. i'm filtering generator function, , weird order of execution misunderstanding giving me duplicate entry. here's setup code (this happening inside function, other details don't matter):
import xml.etree.celementtree et dictionary = iter(et.iterparse("../dictionaries/language_name.xml", events=("start", "end"))) #we can discard original iterable, think
filtering
then have function receives iterator , filters (ignore global variable, it's debugging problem):
def get_entries(iterparsed): global yielded root = next(iterparsed)[1] #iterpase gives (event, element) yield root event, elem in iterparsed: if event == "end" , elem.tag == "entry": yielded += 1 print("num yielded:", yielded) print("yielding", et.tostring(elem, encoding="utf-8")) yield elem
processing
then use (again, temporary global debugging):
root = next(get_entries(dictionary)) elem in get_entries(dictionary): global received received += 1 print("num received:", received) print("i got", et.tostring(elem, encoding="utf-8")) raw_input("continue? ") #i yield first item once, receive twice? :( process_entry(elem) #defined elsewhere, adds <sgmtd> node each entry root.clear() #clears processed children of root node
output
if run through everything, yielded = 9050
while received = 9051
. , problematic output:
num received: 1 got <entry><form>aː</form><ortho>a:</ortho><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> continue? num yielded: 1 yielding <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> num received: 2 got <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> continue? num yielded: 2 yielding <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry> num received: 3 got <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry> continue?
the question
now, i've checked, , elem
isn't defined prior loop starting. , no, there aren't 2 identical elements @ start of file. after first "i received" bit, seems working way expect - things yielded received (eg a:cháj axe yielded first, received).
even more oddly, first element processed before being yielded - without being cleared @ end of loop. first time it's "received", has no <sgmtd> node. when it's "yielded" first time, has <sgmtd> node, indicating it's been processed. it's received again, , (despite line saying if not elem.find("sgmtd"): elem.insert(2, segmented_form)
) second <sgmtd> node added , written out file. output file winds with:
<?xml version="1.0" encoding="utf-8"?> <lexicon> <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> <entry><form>aː</form><ortho>a:</ortho><sgmtd /><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>
so misunderstanding here? how item "received" generator function without of code prior yield
statement being executed?
it turns out changing if not elem.find("sgmtd")
line if elem.find("sgmtd") none
stops duplicate item being processed. guess element
objects don't implicitly convert true
expected. i'd still know why showed up!
both @chad miller , @jochen ritzel pointed out wasn't counting root element yielding. intentional - thought happen generator function never reset, in same way generator objects don't. when started loop for elem in get_entries(dictionary)
, figured root element consumed.
however, if add print statement before yielding root element, gets printed twice. duplication of data seeing caused elem.insert(2, segmented_form)
being called on root, segmented_form
involves using elem.find
(thus searching children) , grabbing first element of tree.
so: reason seeing duplicates because generator functions don't behave same generator objects. lesson learned!
Comments
Post a Comment