xml - Python - Generator function resets between calls? -


i'm parsing language dictionary, represented in xml file, elementtree's iterparse function. i'm filtering generator function, , weird order of execution misunderstanding giving me duplicate entry. here's setup code (this happening inside function, other details don't matter):

import xml.etree.celementtree et dictionary = iter(et.iterparse("../dictionaries/language_name.xml",                    events=("start", "end")))  #we can discard original iterable, think 

filtering

then have function receives iterator , filters (ignore global variable, it's debugging problem):

def get_entries(iterparsed):     global yielded     root = next(iterparsed)[1] #iterpase gives (event, element)     yield root      event, elem in iterparsed:         if event == "end" , elem.tag == "entry":             yielded += 1             print("num yielded:", yielded)             print("yielding", et.tostring(elem, encoding="utf-8"))             yield elem 

processing

then use (again, temporary global debugging):

root = next(get_entries(dictionary)) elem in get_entries(dictionary):     global received     received += 1     print("num received:", received)     print("i got", et.tostring(elem, encoding="utf-8"))     raw_input("continue? ")      #i yield first item once, receive twice? :(     process_entry(elem) #defined elsewhere, adds <sgmtd> node each entry     root.clear() #clears processed children of root node 

output

if run through everything, yielded = 9050 while received = 9051. , problematic output:

num received: 1 got <entry><form>aː</form><ortho>a:</ortho><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>  continue?  num yielded: 1 yielding <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>  num received: 2 got <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry>  continue? num yielded: 2 yielding <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry>  num received: 3 got <entry><form>aːčáx</form><ortho>a:cháj</ortho><pos>n</pos><sense><def><en>axe</en><es>hacha</es></def></sense></entry>  continue? 

the question

now, i've checked, , elem isn't defined prior loop starting. , no, there aren't 2 identical elements @ start of file. after first "i received" bit, seems working way expect - things yielded received (eg a:cháj axe yielded first, received).

even more oddly, first element processed before being yielded - without being cleared @ end of loop. first time it's "received", has no <sgmtd> node. when it's "yielded" first time, has <sgmtd> node, indicating it's been processed. it's received again, , (despite line saying if not elem.find("sgmtd"): elem.insert(2, segmented_form)) second <sgmtd> node added , written out file. output file winds with:

<?xml version="1.0" encoding="utf-8"?> <lexicon> <entry><form>aː</form><ortho>a:</ortho><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> <entry><form>aː</form><ortho>a:</ortho><sgmtd /><sgmtd /><pos>dcadv</pos><sense><def><en>over here</en><es>acá</es></def></sense></entry> 

so misunderstanding here? how item "received" generator function without of code prior yield statement being executed?

it turns out changing if not elem.find("sgmtd") line if elem.find("sgmtd") none stops duplicate item being processed. guess element objects don't implicitly convert true expected. i'd still know why showed up!

both @chad miller , @jochen ritzel pointed out wasn't counting root element yielding. intentional - thought happen generator function never reset, in same way generator objects don't. when started loop for elem in get_entries(dictionary), figured root element consumed.

however, if add print statement before yielding root element, gets printed twice. duplication of data seeing caused elem.insert(2, segmented_form) being called on root, segmented_form involves using elem.find (thus searching children) , grabbing first element of tree.

so: reason seeing duplicates because generator functions don't behave same generator objects. lesson learned!


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -