python - Parse Sitemap Quickly -


i have 30 sitemap files below:

<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url>     <loc>http://www.a.com/a</loc>     <lastmod>2013-08-01</lastmod>     <changefreq>weekly</changefreq>     <priority>0.6</priority> </url> <url>     <loc>http://www.a.com/b</loc>     <lastmod>2013-08-01</lastmod>     <changefreq>weekly</changefreq>     <priority>0.6</priority> </url> ... </urlset> 

the output want 4 columns each row each url tag, print out screen

http://www.a.com/a 2013-08-01 weekly 0.6 http://www.a.com/b 2013-08-01 weekly 0.6  

the way using python beautifulsoup parse tag out, however, performance horribly slow since there 30+ files there , 300,000 lines per file. wondering possible use shell awk or sed or.. using wrong tools that.

since sitemap formatted, there might regular expression tricks around it.

any 1 have experience dividing records/rows in awk or sed multiple lines instead of new line character?

thanks lot!

i wouldn't suggest regular expressions general way of parsing arbitrary xml or html, since said well-formed usual warning can ignored in case:

sed -n '/^<url>$/{n;n;n;n;s/\n/ /g;s/ *<[a-z]*>//g;s/<\/[a-z]*>/ /g;p}' 

here commented version explains going on:

sed -n '/^<url>$/ {  # if line contains <url>   n;n;n;n              # read next 4 lines pattern space   s/\n//g              # remove newlines   s/ *<[a-z]*>//g      # remove opening tags , spaces before them   s/<\/[a-z]*>/ /g     # replace closing tags space   p                    # print pattern space }' test.txt 

the -n option suppresses automatic printing of pattern space.


Comments

Popular posts from this blog

c# - Send Image in Json : 400 Bad request -

jquery - Fancybox - apply a function to several elements -

An easy way to program an Android keyboard layout app -