python - Parse Sitemap Quickly -
i have 30 sitemap files below:
<?xml version="1.0" encoding="utf-8"?><urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.a.com/a</loc> <lastmod>2013-08-01</lastmod> <changefreq>weekly</changefreq> <priority>0.6</priority> </url> <url> <loc>http://www.a.com/b</loc> <lastmod>2013-08-01</lastmod> <changefreq>weekly</changefreq> <priority>0.6</priority> </url> ... </urlset>
the output want 4 columns each row each url tag, print out screen
http://www.a.com/a 2013-08-01 weekly 0.6 http://www.a.com/b 2013-08-01 weekly 0.6
the way using python beautifulsoup parse tag out, however, performance horribly slow since there 30+ files there , 300,000 lines per file. wondering possible use shell awk or sed or.. using wrong tools that.
since sitemap formatted, there might regular expression tricks around it.
any 1 have experience dividing records/rows in awk or sed multiple lines instead of new line character?
thanks lot!
i wouldn't suggest regular expressions general way of parsing arbitrary xml or html, since said well-formed usual warning can ignored in case:
sed -n '/^<url>$/{n;n;n;n;s/\n/ /g;s/ *<[a-z]*>//g;s/<\/[a-z]*>/ /g;p}'
here commented version explains going on:
sed -n '/^<url>$/ { # if line contains <url> n;n;n;n # read next 4 lines pattern space s/\n//g # remove newlines s/ *<[a-z]*>//g # remove opening tags , spaces before them s/<\/[a-z]*>/ /g # replace closing tags space p # print pattern space }' test.txt
the -n
option suppresses automatic printing of pattern space.
Comments
Post a Comment