ruby - Extract text between <br > tags -
to extract url's using following:
html = open('http://lab/links.html') urls = uri.extract(html) this works great.
now need extract list of url's without prefix http or https, between <br > tags. since there no http or https tags, uri.extract doesnt work.
domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php each unprefixed url between <br > tags.
i have been looking @ nokogiri xpath retrieve text after <br> within <td> , <span> couldnt work.
output
domain1.com/index.html domain2.com/home/~john/index.html domain3.com/a/b/c/d/index.php intermediate solution
doc = nokogiri::html(open("http://lab/noprefix_domains.html")) doc.search('br').each |n| n.replace("\n") end puts doc i still need strip out rest of html tags (!doctype, html, body, p)...
solution
str = "" doc.traverse { |n| str << n.to_s if (n.name == "text" or n.name == "br") } puts str.split /\s*<\s*br\s*>\s*/ thanks.
assuming have method extract example string showed in question, can use split on string:
str = "domain1.com/index.html<br >domain2.com/home/~john/index.html<br >domain3.com/a/b/c/d/index.php" str.split /\s*<\s*br\s*>\s*/ #=> ["domain1.com/index.html", # "domain2.com/home/~john/index.html", # "domain3.com/a/b/c/d/index.php"] this split string @ every <br> tag. remove whitespace before , after <br> , allow whitespace inside <br> tag, e.g. <br > or < br>. if need handle self-closing tags, (e.g. <br />), use regex instead:
/\s*<\s*br\s*\/?\s*>\s*/
Comments
Post a Comment